Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

Similar documents
Convolutional Networks 2: Training, deep convolutional networks

A. Distribution of the test statistic

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

Statistical Learning Theory: A Primer

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM

arxiv: v1 [cs.lg] 31 Oct 2017

A Simple and Efficient Algorithm of 3-D Single-Source Localization with Uniform Cross Array Bing Xue 1 2 a) * Guangyou Fang 1 2 b and Yicai Ji 1 2 c)

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Determining The Degree of Generalization Using An Incremental Learning Algorithm

Stochastic Variational Inference with Gradient Linearization

Paragraph Topic Classification

High Spectral Resolution Infrared Radiance Modeling Using Optimal Spectral Sampling (OSS) Method

II. PROBLEM. A. Description. For the space of audio signals

FFTs in Graphics and Vision. Spherical Convolution and Axial Symmetry Detection

Turbo Codes. Coding and Communication Laboratory. Dept. of Electrical Engineering, National Chung Hsing University

DISTRIBUTION OF TEMPERATURE IN A SPATIALLY ONE- DIMENSIONAL OBJECT AS A RESULT OF THE ACTIVE POINT SOURCE

A Novel Learning Method for Elman Neural Network Using Local Search

Fast Blind Recognition of Channel Codes

The EM Algorithm applied to determining new limit points of Mahler measures

Separation of Variables and a Spherical Shell with Surface Charge

CS229 Lecture notes. Andrew Ng

BP neural network-based sports performance prediction model applied research

Two view learning: SVM-2K, Theory and Practice

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

Melodic contour estimation with B-spline models using a MDL criterion

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

8 Digifl'.11 Cth:uits and devices

Cryptanalysis of PKP: A New Approach

A Brief Introduction to Markov Chains and Hidden Markov Models

Moreau-Yosida Regularization for Grouped Tree Structure Learning

Statistical Learning Theory: a Primer

hole h vs. e configurations: l l for N > 2 l + 1 J = H as example of localization, delocalization, tunneling ikx k

Computing Spherical Transform and Convolution on the 2-Sphere

A Fundamental Storage-Communication Tradeoff in Distributed Computing with Straggling Nodes

Structured sparsity for automatic music transcription

ESTIMATION OF SAMPLING TIME MISALIGNMENTS IN IFDMA UPLINK

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

Centralized Coded Caching of Correlated Contents

FORECASTING TELECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS

Target Location Estimation in Wireless Sensor Networks Using Binary Data

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

Haar Decomposition and Reconstruction Algorithms

C. Fourier Sine Series Overview

Radar/ESM Tracking of Constant Velocity Target : Comparison of Batch (MLE) and EKF Performance

STA 216 Project: Spline Approach to Discrete Survival Analysis

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Appendix A: MATLAB commands for neural networks

Emmanuel Abbe Colin Sandon

Automobile Prices in Market Equilibrium. Berry, Pakes and Levinsohn

A proposed nonparametric mixture density estimation using B-spline functions

Technical Appendix for Voting, Speechmaking, and the Dimensions of Conflict in the US Senate

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

SVM: Terminology 1(6) SVM: Terminology 2(6)

A Robust Voice Activity Detection based on Noise Eigenspace Projection

Supervised i-vector Modeling - Theory and Applications

Section 6: Magnetostatics

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

Study on Fusion Algorithm of Multi-source Image Based on Sensor and Computer Image Processing Technology

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

Optimal Control of Assembly Systems with Multiple Stages and Multiple Demand Classes 1

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

Explicit overall risk minimization transductive bound

LINK BETWEEN THE JOINT DIAGONALISATION OF SYMMETRICAL CUBES AND PARAFAC: AN APPLICATION TO SECONDARY SURVEILLANCE RADAR

Active Learning & Experimental Design

BDD-Based Analysis of Gapped q-gram Filters

PREDICTION OF DEFORMED AND ANNEALED MICROSTRUCTURES USING BAYESIAN NEURAL NETWORKS AND GAUSSIAN PROCESSES

Efficiently Generating Random Bits from Finite State Markov Chains

BALANCING REGULAR MATRIX PENCILS

Asymptotic Properties of a Generalized Cross Entropy Optimization Algorithm

SVM-based Supervised and Unsupervised Classification Schemes

Efficient Visual-Inertial Navigation using a Rolling-Shutter Camera with Inaccurate Timestamps

arxiv: v1 [quant-ph] 4 Nov 2008

Discrete Techniques. Chapter Introduction

Multilayer Kerceptron

Effective Appearance Model and Similarity Measure for Particle Filtering and Visual Tracking

Fitting Algorithms for MMPP ATM Traffic Models

A Better Way to Pretrain Deep Boltzmann Machines

Discrete Techniques. Chapter Introduction

Converting Z-number to Fuzzy Number using. Fuzzy Expected Value

Unconditional security of differential phase shift quantum key distribution

Research of Data Fusion Method of Multi-Sensor Based on Correlation Coefficient of Confidence Distance

arxiv: v1 [math.fa] 23 Aug 2018

Seasonal Time Series Data Forecasting by Using Neural Networks Multiscale Autoregressive Model

Bayesian Unscented Kalman Filter for State Estimation of Nonlinear and Non-Gaussian Systems

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

From Margins to Probabilities in Multiclass Learning Problems

Asynchronous Control for Coupled Markov Decision Systems

arxiv: v2 [cond-mat.stat-mech] 14 Nov 2008

Probabilistic Graphical Models

<C 2 2. λ 2 l. λ 1 l 1 < C 1

arxiv: v1 [math.ca] 6 Mar 2017

HILBERT? What is HILBERT? Matlab Implementation of Adaptive 2D BEM. Dirk Praetorius. Features of HILBERT

Reliability: Theory & Applications No.3, September 2006

Progressive Correction for Deterministic Dirac Mixture Approximations

Chemical Kinetics Part 2

Transcription:

Optimaity of Inference in Hierarchica Coding for Distributed Object-Based Representations Simon Brodeur, Jean Rouat NECOTIS, Département génie éectrique et génie informatique, Université de Sherbrooke, Sherbrooke, Canada Emai: simon.brodeur@usherbrooke.ca, jean.rouat@usherbrooke.ca Abstract Hierarchica approaches for representation earning have the abiity to encode reevant features at mutipe scaes or eves of abstraction. However, most hierarchica approaches expoit ony the ast eve in the hierarchy, or provide a mutiscae representation that hods a significant amount of redundancy. We argue that removing redundancy across the mutipe eves of abstraction is important for an efficient representation of compositionaity in object-based representations. With the perspective of feature earning as a data compression operation, we propose a new greedy inference agorithm for hierarchica sparse coding. Convoutiona matching pursuit with a L 0 -norm constraint was used to encode the input signa into compact and non-redundant codes distributed across eves of the hierarchy. Simpe and compex synthetic datasets of tempora signas were created to evauate the encoding efficiency and compare with the theoretica ower bounds on the information rate for those signas. Empirica evidence have shown that the agorithm is abe to infer near-optima codes for simpe signas. However, it faied for compex signas with strong overapping between objects. We expain the inefficiency of convoutiona matching pursuit that occurred in such case. This brings new insights about the NP-hard optimization probem reated to using L 0 -norm constraint in inferring optimay compact and distributed objectbased representations. I. INTRODUCTION Unsupervised earning is very attractive today because of the arge amount of unannotated data avaiabe from socia media and sensors in inteigent devices (e.g. smartphones, autonomous cars, home automation systems). Manifod earning (embedding) is currenty one of the most popuar forms of earning a atent, ow-dimensiona feature map that describes some tempora or spatia properties of the input signa (for review, see [1]). There is, however, very itte work on evauating the quaity of those representations based on compression criteria. The goa of the paper is to investigate how the notion of data compression from both theoretica and empirica perspectives can hep us design, compare and understand better the earning of embeddings that efficienty represent the input structures. This work is in ine with custering by compression (e.g. []) and minimum description ength (MDL) approaches (e.g. [3], [4]) to feature earning and coding, which contrasts with discriminative approaches. Object-based representations are very efficient to encode spatiay or temporay independent structures of a signa, and are in fact very common in the rea word (e.g. puse-resonance sounds in audio, visua composition in images, syntactic items in anguage). Sparse coding (SC) approaches are we adapted to mode such kind of signas (e.g. [14]). We suggest that dictionary-based feature earning and representation with sparse coding coud be comparabe to genera ossess data compression agorithms such as LZ77 [5]. Those agorithms achieve compression by finding redundant sequences in the data, storing a singe copy in a static or siding window dictionary and then use a reference to this copy to repace the mutipe occurrences in the data. This indexing scheme of the dictionary aows the representation to be sparse and compact, meaning that a singe index (or reference to a dictionary eement) describes a arger segment of the input sequence. Most prior work on dictionary earning and optima data compression considers string sequences with discrete symbos (e.g. text documents, DNA sequences), as they are primariy used for fie compression (e.g. [5]). We rather consider sequences of continuous vaues, and propose a new greedy inference agorithm for hierarchica sparse coding. In this work, we focus on the inference process, i.e. finding the optima sets of sparse coefficients given the signa and earned dictionaries. The goa is to evauate how the hierarchica and compression frameworks can improve object-based representations. II. SPARSE CODING Assume a simpe inear decomposition S = XD. The variabe S defines a set of M signas of dimensionaity N, i.e. S = [s 1,..., s M ] R M N. The variabe D defines a dictionary of K bases of dimensionaity N, i.e. D = [d 1,..., d K ] R K N. The resut is a set of coefficients X = [x 1,..., x M ] R M K that represents the input S. Sparse coding (SC) aims at providing a set of sparse coefficients by the penaization of the L 1 norm of X, as shown in Equation 1 (for review, see [6]). The constant α contros the tradeoff between the reconstruction error and the sparsity of the coefficients. (X, D ) = arg min X,D 1 S XD + α X 1 (1) with constraint d k = 1 for k {1,,..., K} Sparse coding aows to compact the energy of the representation into a few high-vaued coefficients whie etting the remaining coefficients sit near zero. Compression is achieved by imiting the amount of information that needs to be stored to represent those coefficients (e.g. by negecting most owvaued coefficients). Sparse coding can be appied to sequences of variabe engths (e.g. tempora signas) by changing the 978-1-5090-606-9/17/$30 017 IEEE

Passthrough feature maps Higher-eve feature maps Distributed representation no strong correation strong correation feature maps greedy eve-wise inference input Fig. 1. Greedy eve-wise inference scheme for distributed representations in hierarchica sparse coding. When the input S is convoved with the dictionary D 1 at eve and greedy inference (shown as white dots) is appied to seect sparse activations in the feature maps, the first-eve sparse representation X std 1 is created. X std 1 now becomes the input signa for the next eve. The second eve can choose during inference to combine severa ower-eve activations as modeed by the dictionary D and output information in the higher-eve feature maps X std, or forward particuar sparse activations as is with the passthrough feature maps X pass if sparse activations in X std 1 do not correate strongy with D. The same process is repeated for the next eves in the hierarchy. From the output of eve, both passthrough and higher-eve features maps describe the input signa in a distributed way using the non-redundant sets of coefficients {X 1, X, X 3 }. The passthrough feature maps aso aow earning over mutipe scaes. Note that the widths of the dictionaries increase over the hierarchy, so that the considered tempora context increases at each eve and thus aows to encode onger time dependencies between sparse activations. matrix mutipication operator into a discrete convoution 1, as shown in Equation. This adds time shift (transation) invariance properties to the dictionary D, and D R K W can now have tempora ength W < N to represent oca features in the signa. Moreover, the direct penaization of the L 0 -norm of X can be used instead. This is more adapted to compression, since ony the information about non-zero coefficients needs to be stored. The drawback with the L 0 -norm is that it makes the optimization probem NP-hard [7]. (X, D ) = arg min X,D 1 S X D + α X 0 () with constraint d k = 1 for k {1,,..., K} Learning representations and encoding the input at a too restricted tempora ength W does not aow to earn highereve objects that derive from the composition of mutipe oca features. This degrades the abiity to compress information. Thus, we propose to efficienty incorporate a hierarchica aspect into the convoutiona sparse coding framework. III. PROPOSED HIERARCHICAL SPARSE CODING Many of the successes in machine earning that ed to the sub-fied of deep earning is about hierarchica processing. This means stacking in a hierarchy severa eves of representations or processing ayers to buid an abstract representation of the input usefu for a particuar task (e.g. cassification, contro). However, many approaches to representation earning (e.g. [1], [8]) ony consider the output of the ast eve and thus ose crucia information about the compositionaity of objects. This means that fine information is mixed together with the higher-eve, coarse information about objects. It aso scaes inefficienty with the context, as the number of possibe objects typicay increases at an exponentia rate when the tempora or physica context is made arger. Some approaches 1 The definition of convoution from machine earning is used, where the fiter is not time-reversed. This convoution is equa to a cross-correation. (e.g. [9]) aggregate encodings at mutipe independent scaes prior to cassification, but disregard the redundancy across the scaes. An efficient hierarchica processing means encoding objects at the eve at which they naturay appear, rather than re-encode those to propagate the information up in the hierarchy. This is simiar to the mutiresoution decomposition scheme of the discrete waveet transform [10], except that hierarchica sparse coding is not constrained to the timefrequency domain anaysis. This avoids the tradeoff between time and frequency resoutions. Hierarchica sparse coding aso contrasts with other methods such as mutistage vector quantization (e.g. [11]), which encode information maximay at each eve and ony pass the residua information (i.e. the reconstruction error) to the next eve in the hierarchy. These methods do not have the abiity to earn high-eve objects, since the objects are encoded into their most basic constituents at the very first eves, and the residua energy decreases rapidy towards zero. This is because ow-eve constituents are typicay easier to earn and encode. The proposed method for hierarchica processing in a sparse coding framework is described in Equation 3. In this hierarchica decomposition scheme, the reconstruction is based on the sets of coefficients {X 1, X,..., X L } from L eves, with inear composition across eves based on their respective dictionaries {D 1, D,..., D L }. The sets of coefficients at a eves do not share redundant information, which makes the representation compact in terms of data compression. Note that the dictionaries are now mutidimensiona arrays (tensors) of rank 3, i.e. D R K W K 1. The dictionary at eve defines a set of K bases (or convoution fiters) of dimensionaity W K 1, where W is the tempora ength of the bases and K 1 is the number of features from the previous eve. The variabe S now defines a (tempora) sequence of ength N t, with features vectors of dimensionaity N s, i.e. S = [s 1,..., s Nt ] R Nt Ns. The set of coefficients X at eve is a mutidimensiona array of rank 3, i.e. X R Nt K 1 K, that are aso caed feature maps.

Θ = {X 1, X,..., X L, D 1, D,..., D L } ( ( Θ 1 L = arg min Θ S )) X D k =1 k=1 L + α X 0 (3) =1 with constraint d k, = 1 for k {1,,..., K}, {1,,..., L} and operator as a composition of convoutions: L D = D 1 D D L =1 A. Greedy Inference for Hierarchica Sparse Coding The hierarchica inference scheme is iustrated in figure 1. We extended convoutiona matching pursuit (e.g. [14]) and appied it to the hierarchy in a greedy eve-wise manner. At higher eves of the hierarchy (i.e. > 1), we introduce two types of feature maps defined by X pass and X std : passthrough features maps from the previous eves, and standard feature maps obtained directy at eve from the decomposition with dictionary D. The passthrough features maps aow each eve to forward to the next eve the sparse activations that do not correate we with the bases of dictionary D. This can be impemented by considering during matching pursuit a second dictionary D pass R Kpass W K pass containing a singe Kronecker deta function per dictionary basis so that S D pass = S. This corresponds to copying specific sparse activations from the input to output feature maps X pass. K pass = 1 i=1 K i is the number of aggregated features maps (both passthroughs and standard) from the previous eve. To aow earning features over mutipe scaes, the dictionary D at each eve can aso cover the passthrough feature maps of the previous eve, thus increasing its dimensionaity to D R K W (K pass 1 +K 1). The hierarchica sparse coding inference agorithm SOLVE-HSC is described beow. Note that more sophisticated matching pursuit methods can be integrated in SOLVE-MP, such as ow compexity orthogona matching pursuit (LoCOMP) [13]. The constant β aows to contro the aocation of coefficients in the passthrough feature maps. A. Dataset Generation IV. EXPERIMENT Mutipe synthetic datasets were created to evauate the hierarchica encoding performance in terms of compression. The first dataset named dataset-simpe has no overap between objects in the decomposition and the generated signa when sparse activations are composed in time. The decomposition depends ony on the previous eve, and thus the dictionaries do not use the passthrough feature maps. The second dataset named dataset-compex has no such constraints for overap, function SOLVE-HSC(S, {D 1, D,..., D L}, β) Sove for coefficients at first-eve (no passthrough) Note that X pass 1 wi be an empty coefficient set X pass 1, X std 1 SOLVE-MP(S, D 1, 0) Iterate for a higher eves (with passthrough) for = to L do Concatenate passthrough and standard feature maps, then sove with matching pursuit agorithm X pass, X std SOLVE-MP(X pass 1 Xstd 1, D, β) end for Extract output coefficients sets from eve L X 1, X, X L 1 SPLIT(X pass L ) X L X std L Cacuate the overa residua on the input R S ( ( L )) =1 X k=1 D k return {X 1, X,..., X L}, R end function function SOLVE-MP(S, D, β) Create a passthrough dictionary such that S D pass = S D pass Kronecker deta functions R 1 S Initiaize residua X pass, X std Empty coefficient sets n 1 Iterate over( the residua ) unti convergence at given SNR S whie 10 og F < SNR R n thres do F Cacuate convoution with dictionaries C pass, C std R n D pass, R n D Seect best activated basis for both dictionaries, at respective dictionary index k and center position t, d std arg max C pass, arg max C std Compare coefficients from both seected bases d pass if β C pass > C std then Add coefficient to passthrough set X pass X pass {C pass } Remove contribution ( of seected ) basis to residua R n+1 R n C pass d pass k ese Add coefficient to standard set X std X std {C std } Remove contribution of seected basis to residua R n+1 R n ( ) C std d std k end if n n + 1 end whie return {X pass, X std } end function and is usefu to assess the encoding of compositionaity at mutipe scaes with compexity coser to rea-word signas. The first-eve dictionary bases were generated using 1D Perin noise [1]. Hanning windowing was used to make the bases ocaized in time and avoid sharp discontinuities when overaps occur. Higher-eve dictionary bases were generated by uniformy samping a fixed number Nd of activated subbases from the previous eve. Nd corresponds to the decomposition size at eve. The reative random positions of the seected sub-bases were uniformy samped in time domain within the defined tempora ength. The reative random co-

TABLE I SPECIFICATIONS OF THE GENERATED DATASETS. Dataset name L4 Input-eve scaes: 16 64 18 56 dataset-simpe Dictionary sizes (K): 4 8 16 3 Decomposition sizes (N d ): - 3 Input-eve scaes: 3 64 18 56 dataset-compex Dictionary sizes (K): 4 8 16 3 Decomposition sizes (N d ): - 4 3 3 Ampitude 1.5 Ampitude 1.5 Input-eve scae 56 Input-eve scae 56 1.5 0 1000 000 3000 4000 5000 Sampes 1.5 0 1000 000 3000 4000 5000 Sampes Abstraction L4 18 64 16 (a) Dictionaries for dataset-simpe Abstraction L4 18 64 3 (b) Dictionaries for dataset-compex Fig.. Exampe of dictionary bases at each of the 4 eves in the hierarchy that was used to generate the datasets. The bod bars on the right indicate the tempora span of each basis (input-eve scae), in sampes. As the eve increases, the bases become more compex and abstract. efficients of the seected sub-bases were uniformy samped from the interva [, ]. Tabe I gives the specifications of the dictionaries for the two datasets, where the input-eve scaes, sizes of the dictionaries or decomposition sizes vary across eves. Figure shows some bases of the datasets. For dataset-simpe, the tempora signa was generated by concatenating the input-eve representations of randomy chosen bases across a eves (with equa probabiities), since there is no overapping. For dataset-compex, the tempora sequence was generated from the dictionaries using independent homogeneous Poisson processes to sampe sparse activations (events) in the sets of coefficients {X 1, X, X 3, X 4 }. Each dictionary basis was assigned a Poisson process with fixed rate λ = 9 10 4 (in units of events/sampe), chosen so that the generated signa was not temporay too compact or sparse. In both databases, random coefficients for the sparse activations were uniformy samped in the interva [, 4.0]. The ength of the generated signas was 100,000 sampes, with roughy 650 activations distributed across the various eves. Figure 3 shows segments of the datasets. We impemented the hierarchica sparse coding agorithm and dataset generation in Python using the Numpy and Scipy ibraries. B. Evauation Method and Resuts In the experiments, the optima (reference) dictionaries that generated the time series were given to the matching pursuit agorithm. This was to test the optimaity of inference ony, and not dictionary earning. To quantify the performance of the proposed inference agorithm, we evauated the information (a) Segment for dataset-simpe (b) Segment for dataset-compex Fig. 3. Exampe of segments from the datasets. When averaged over a onger time scae, the information rate needed to encode each signa remains constant. rate at the output of the encoder when it is imited to a given maximum eve, and compared with the ower bound I opt for that eve. Equation 4 defines how the optima information rate I opt can be cacuated for both datasets, based on the known coefficient sets that were used to generate the signas, and the known decomposition sizes. Equation 5 describes how the actua information rate I act max can be cacuated from the output coefficients of an encoder at evauation time. Note that there must be toerance on the reconstruction signa-to-noise ratio (SNR) during inference. We assumed 40 db SNR to be sufficient for many types of signas (e.g. images, audio). We used 3 bits foating point precision for the ampitudes of the coefficients (i.e. I fp = 3). I opt = =1 L =+1 X 0 N }{{ t } Activation ( rate at eve I activation at eve ) N X h 0 d N t h =+1 Activation rate transfered from higher-eve h to eve based on decomposition size N h d with I = (og L + og K + og N t ) Indexing bits for eve, basis and time I act max = =1 X 0 N t Activation rate at eve + (4) I max activation at eve I activation at eve + I fp Ampitude bits Figure 4 shows that the weighting factor β can indeed contro the distribution of non-zero coefficients amongst eves. Figure 5 shows that for a hierarchica inference task on dataset-simpe, the framework achieves performance cose to the ower bound of the information rate. It means that mutipe eves of the hierarchy are used effectivey assuming the proper choice of β. If β is too ow, a sparse activations wi be reconstructed from the higher-eve dictionaries, which may not be adapted and thus significanty degrade compression (5)

Distribution ratio 0.9 0.8 0.7 0.6 0.4 0.3 0. 0.1 0.75 0.90 0.00 β weight Leve 4 Leve 3 Leve Leve 1 Fig. 4. Distribution of non-zero coefficients inferred across eves for datasetsimpe, as a function of the weighting factor β. Lower β vaues (e.g. β = 0.75) increased the percentage of higher-eve activations. When β was too high (e.g. β = ), a was eventuay encoded by the first eve ony. Information rate [bit/sampe] 10 8 6 4 Maximum eve max L4 β = 0.75 β = 0.90 β = 0 β =.00 Lower bound 0 0 50 100 150 00 50 300 Maximum input-eve scae (a) dataset-simpe Information rate [bit/sampe] 180 160 140 10 100 80 60 40 0 Maximum eve max L4 β = 0.90 β = 0 β =.00 Lower bound 0 0 50 100 150 00 50 300 Maximum input-eve scae (b) dataset-compex Fig. 5. Optimaity of the hierarchica inference on the datasets as a function of the maximum eve considered. On dataset-simpe (shown on eft), using the weighting factor β = achieved the best I act performance, near the ower bound defined by I opt. Too ow weighting when β = 0.75 eads to ess efficient encoding with eves. Too high weighting when β =.0 eads to no decrease in information rate I act with eves. On dataset-compex (shown on right), the inference faied to find an efficient encoding for a vaues of β because of the imitations of convoutiona matching pursuit. performance. If β is too high, the input wi be encoded by the first eve ony and propagated up in the hierarchy ony by the passthrough feature maps. The same anaysis on dataset-compex ed to very different resuts when we tested the imits of convoutiona matching pursuit, indicating it can strugge on compex signas when overapping between objects occurs frequenty. For a vaues of β tested, we observed an increase in the information rate as more eves are considered at the output of the encoder. It was observed that using L 0 -norm reguarization during inference woud add noise in the residua that needed to be removed ater, and thus required a significant number of bits to encode. We aso observed simiar resuts (not shown) when using the LoCOMP [13] variant of matching pursuit. Aowing dictionary earning in the optimization process (see Equation 3) coud sove this probem, as the higher-eve dictionaries can be adapted to the noise introduced at inference due to matching pursuit. V. CONCLUSION In this work, we proposed a hierarchica sparse coding agorithm to efficienty encode an input signa into compact and non-redundant codes distributed across eves of the hierarchy. The main idea is to represent features at the abstraction eve at which they naturay appear, rather than re-encode those through the higher-eve features of the hierarchy. With the proposed method, it is possibe to empiricay achieve nearoptima representations in terms of signa compression, as indicated by the information rate needed to transmit the output sparse activations. The experiments highighted the important roe of the greedy inference agorithm at each eve. Poor inference such as aocating too much coefficients because of interference effects directy impacts the information rate, as seen with convoutiona matching pursuit. This effect was observed on compex signas, where the compression abiity vanished because of the interference-reated variabiity. Future work shoud aim at soving this probem that is primariy due to the usage of L 0 -norm reguarization. ACKNOWLEDGMENT The authors woud ike to thank the ERA-NET (CHIST- ERA) and FRQNT organizations for funding this research as part of the European IGLU project. REFERENCES [1] Y. Bengio, A. Courvie, and P. Vincent, Representation earning: A review and new perspectives, IEEE Transactions on Pattern Anaysis and Machine Inteigence, vo. 35, no. 8, pp. 1798 188, 013. [] R. Ciibrasi and P Vitányi, Custering by compression, IEEE Transactions on Information Theory, vo. 51, no. 4, pp. 153 1545, 005. [3] A. Barron, J. Rissanen, and B. Yu, The Minimum Description Length Principe in Coding and Modeing, IEEE Transactions on Information Theory, vo. 44, no. 6, pp. 743 760, 1998. [4] I. Ramirez and G. Sapiro, An MDL framework for sparse coding and dictionary earning, IEEE Transactions on Signa Processing, vo. 60, no. 6, pp. 913 97, 01. [5] J. Ziv and A. Lempe, A Universa Agorithm for Sequentia Data Compression, IEEE Transactions on Information Theory, vo. 3, no. 3, pp. 337 343, 1977. [6] B. R. Rubinstein, A. M. Bruckstein, and M. Ead, Dictionaries for Sparse Representation Modeing, Proceedings of the IEEE, vo. 98, no. 6, pp. 1045 1057, 010. [7] B. K. Natarajan, Sparse Approximate Soutions to Linear Systems, SIAM Journa on Computing, vo. 4, no., pp. 7 34, 1995. [8] L. Bo, X. Ren, and D. Fox, Mutipath Sparse Coding Using Hierarchica Matching Pursuit, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 660 667, 013. [9] K. Yu, Y. Lin, and J. Lafferty, Learning image representations from the pixe eve via hierarchica sparse coding, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1713 170, 011. [10] S. Maat, Mutiresoution Representations and Waveets, Doctora Dissertation, University of Pennsyvania, 1988. [11] B.-H. Juang and A. Gray, Mutipe Stage Vector Quantization for Speech Coding, in IEEE Internationa Conference on Acoustics, Speech, and Signa Processing, no. 3, pp. 597 600, 198. [1] K. Perin, Improving noise, ACM Transactions on Graphics, vo. 1, no. 3, pp. 3, 00. [13] B. Maihe, R. Gribonva, F. Bimbot, and P. Vandergheynst, A ow compexity Orthogona Matching Pursuit for sparse signa approximation with shift-invariant dictionaries, in IEEE Internationa Conference on Acoustics, Speech and Signa Processing, pp. 3445 3448, 009. [14] M. S. Lewicki and T. J. Sejnowski, Coding time-varying signas using sparse, shift-invariant representations, in Advances in Neura Information Processing, pp. 730 736, 1999.