Depth versus Breadth in Convolutional Polar Codes

Depth versus Breadth in Convolutional Polar Codes Maxime Tremlay, Benjamin Bourassa and David Poulin,2 Département de physique & Institut quantique, Université de Sherrooke, Sherrooke, Quéec, Canada JK 2R 2 Canadian Institute for Advanced Research, Toronto, Ontario, Canada M5G Z8 maxime.tremlay9@usherrooke.ca arxiv:805.09306v [cs.it] 23 May 208 Astract Polar codes were introduced in 2009 y Arikan as the first efficient encoding and decoding scheme that is capacity achieving for symmetric inary-input memoryless channels. Recently, this code family was extended y replacing the lockstructured polarization step of polar codes y a convolutional structure. This article presents a numerical exploration of this so-called convolutional polar codes family to find efficient generalizations of polar codes, oth in terms of decoding speed and decoding error proaility. The main conclusion drawn from our study is that increasing the convolution depth is more efficient than increasing the polarization kernel s readth as previously explored. I. INTRODUCTION Polar codes uild on channels polarization to efficiently achieve the capacity of symmetric channels (refer to [], [2], [3], [4] for detailed presentations). Channel polarization is a method that takes two independent inary-input discrete memoryless channels W (y x) to a ad channel and a good channel, given y W (y u 2 ) = W (y 2 u 2 )W (y u u 2 ), () u 2 {0,} W (y 2, u u 2 ) = W (y 2 u 2 )W (y u u 2 ) (2) respectively, where x a = (x a, x a+... x ). These channels are otained y comining two copies of W (y x) with a CNOT gate (u, u 2 ) (u u 2, u 2 ) and then decoding successively its u and u 2. That is, output it u is decoded first assuming that u 2 is erased. Then it u 2 is decoded taking into account the previously decoded value of u. Polar codes are otained y recursing this process to otain 2 l different channels from the polarization of 2 l pair of channels (Fig. a). As the numer of polarization steps l goes to infinity, the fraction of channels for which the error proaility approaches 0 tends to I(W ) and the fraction of channels for which the error proaility approaches tends to I(W ), where I(W ) is the mutual information of the channel with uniform distriution of the inputs []. Thus, polar codes are capacity achieving for those channels. The aove construction can e generalized y replacing the CNOT transformation y a different polarization kernel [5]. See Sec. III-A for details. The kernel can generally take as input more than two copies of the channel W (y x) and the readth of a kernel is define as the numer of channels it comines. An increasing readth offers the possiility of a more efficient polarization (i.e. a lower decoding error proaility), ut has the drawack of an increased decoding (c) Fig.. Examples of regular (depth=) and convolutional (depth> ) polar code circuits. The parameters (readth, depth, polarization steps) are (2,,4), (3,,3), (c) (2,2,4) and (d) (3,3,3). Another possile generalization of polar codes is to replace the lock-structure polarization procedure y a convolutional structure. See Sec. III-B for details. Note indeed that each polarization step of a polar code consists of independent application of the polarization kernel on distinct locks of its (pairs of its in the aove example with = 2). Recently ([6], [7]), this idea was extended to a convolutional structure (see Fig. c and Fig. d), where each polarization step does not factor into a product of independent transformations on disjoint locks ut instead consists of d layers of shifted lock transformations. We refer to the numer of layers d as the depth of a code. An increasing depth offers the advantage of faster polarization and the drawack of an increased decoding The focus of the present work is to compare the trade-off etween readth and depth in terms of the speed at which the decoding error rate goes to zero and the decoding We focus on codes which have practically relevant sizes using Monte Carlo numerical simulations. II. DECODING In this section, the general successive cancellation decoding scheme is define in terms of tensor networks. This enales a straightforward extension to convolutional polar codes. A. Successive cancellation Define G as the reversile encoding circuit acting on N input its and N output its. K of these input its take (d)

Fig. 2. Schematic representation of the successive cancellation decoder. A composite channel is otain from an encoding circuit G and N copies of a channel W. Contracting this tensor network for given y N and un yields Eq. 3. An effective channel is otain from the composite channel y summing over all the values of its u N i+, graphically represented y the uniform tensor e = ( ), when decoding it ui. Contracting this tensor yields Eq. 4 up to a normalization factor. aritrary values u i while the N K others are frozen to the value u i = 0. From this input u N, the message x N = Gu N is transmitted. The channel produces the received message y N, resulting in a composite channel W G (y N u N ) = N W (y i (Gu N ) i ). (3) i= This composite channel induces a correlated distriution on the its u i and is represented graphically on Fig. 2a. Successive cancellation decoding converts this composite channel into N different channels given y W (i) G (yn, u i u i ) = W G (y N u N ), (4) u i+,u N for i =, 2,... N. Those channels are otain y decoding successively symols u through u N (i.e., from right to left on Fig. 2) y summing over all the its that are not yet decoded and fixing the value of all the its u i. Either to their frozen value, if the corresponding original input it was frozen, or to their previously decoded value. This effective channel is represented graphically on Fig. 2. Given W (i) G, u i is decoded y maximizing the likelihood of the acquired information: u i = argmax W (i) G (yn, u i ũ i ). (5) ũ i {0,} Applying this procedure for all its from right to left yield the so-called successive cancellation decoder. Equation 5 can e generalized straightforwardly y decoding not a single it u i at the time ut instead a w-it sequence jointly, collectively viewed as a single symol from a larger alphaet of size 2 w. To this effect, the decoding width w is defined as the numer of its that are decoded simultaneously. u i+w i B. Decoding with tensor networks Convolutional polar codes were largely inspired y tensor network methods used in quantum many-ody physics (see e.g. [8] and [9] for an introduction). Akin of the graphical tools used in information theory (Tanner graph, factor graph, etc.), tensor networks were introduced as compact graphical representation of proaility distriutions (or amplitudes in quantum mechanics) involving a large numer of correlated variales. Moreover, certain computational procedures are more easily cast using these graphical representations. It is the case of the successive cancellation decoding prolem descried aove, where the goal is to compute W (i) G (yn, u i u i ) given fixed values of y N, u i. While G is a F N 2 linear transformation, it is sometime convenient to view it as a linear transformation on the space of proaility over N-it sequences, i.e., the linear space R 2N whose asis vectors are laeled y all possile N-it strings. On this space, G acts as a permutation matrix mapping asis vector u N to asis vector x N = Gu N. A single it is represented in the state 0 y u = ( ( 0), in the state y u = 0 ) and a it string u N is represented y the 2 N dimensional vector u N = u u 2... u N. A single it channel is a 2 2 stochastic matrix and a CNOT gate is given y CNOT = 0 0 0 0 0 0 0 0 0 0 0 0, (6) ecause it permutes the inputs 0 and while leaving the other inputs 00 and 0 unchanged. In this representation W (i) G (yn, u i u i ) = Z [u... u i u i e (N i) ] T GW N y N, (7) where e = ( ) and Z = u W (i) i {0,} G (yn, u i u i ) is a normalization factor. Ignoring normalization, this quantity can e represented graphically as a tensor network (see Fig. 2), where each element of the network is a rank-r tensor, i.e., an element of R 2r. Specifically, a it u i is a rank-one tensor, a channel W is a rank-two tensor, and a two-it gate is a rankfour tensor (two input its and two output its). The CNOT gate is otained y reshaping Eq. 6 into a (2 2 2 2) tensor. In this graphical representation, a rank-r tensor A µ,µ 2,µ r is represented y a degree-r vertex, with one edge associated to each index µ k. An edge connecting two vertices means that the shared index is summed over µ 2 µ A B 2 = X A µ µ 2 B 2 µ µ 2 = C µ µ 2 2 = C 2, (8) generalizing the notion of vector and matrix product to higher rank tensors. Tensors can e assemled into a network where edges represent input-output relations just like in an ordinary logical circuit representation. Evaluating Eq. 7 then amounts to summing over edge values. This computational task, named tensor contraction, generally scales exponentially with the tree-width of the tensor

= Fig. 3. Circuit identities. Any permutation acting on the uniform distriution return the uniform distriution. Any contraction of a permutation and asis vector x t gives another asis vector yt. network [0]. The graphical calculus ecomes valuale when using circuit identities that simplify the tensor network. Specifically, these identities encode two simple facts illustrated on Fig. 3: a permutation G acting on the uniform distriution returns the uniform distriution Ge t = e t, and a permutation acting on a asis vector returns another asis vector Gx N = y N. Once these circuit identities are applied to the evaluation of Eq. 7 in the specific case of polar codes, it was shown in [6], [7] that the resulting tensor network is a tree, so it can e efficiently evaluated. Convolutional polar codes were introduced ased on the oservation that Eq. 7 produces a tensor network of constant tree-width despite not eing a tree (see Fig. 4), an oservation first made in the context of quantum many-ody physics [], so they can also e decoded efficiently. = III. POLAR CODE GENERALIZATIONS In this section, two possile generalizations of polar codes are descried and their decoding complexity is analyzed. A. Breadth Channel polarization can e achieved using various kernels. In fact, as long as a kernel is not a permutation matrix on F 2, it achieves a non-trivial polarization transform [5]. The CNOT gate is one such example that acts on two its. However, a general kernel of readth can act on its, (see Fig. for an illustration with = 3). An increasing readth can produce faster polarization, i.e. a decoding error proaility which decreases faster with the numer of polarization steps. Indeed, in the asymptotic regime, Arikan [] showed that provided the code rate is elow the symmetric channel capacity and that the location of the frozen its are chosen optimally, the asymptotic decoding error proaility of the polar ( code under successive cancellation decoding is P e O 2 N /2). ( ) A different error scaling exponent P e O 2 N β can e achieved from a roader kernel, ut readth 6 is required to asymptotically surpass β = 2 [5]. Such a road polarization kernel has the drawack of a sustantially increased decoding Arikan [] showed that the decoding complexity of polar codes is O(N log 2 N). From a tensor network perspective, this complexity can e understood [7] y counting the numer of elementary contractions required to evaluate Eq. 7 and y noting that the tensor network corresponding to Eq. 7 for u i and for u i+ differ only on a fraction / log 2 N of locations, so most intermediate calculations can e recycled and incur no additional As discussed previously, a readth- polarization kernel can also e represented as a 2 2 permutation matrix that act on R 2. Applying such a matrix to a -it proaility distriution has complexity 2, and this dominates the complexity of each elementary tensor operation of the successive cancellation decoding algorithm. On the other hand, the total numer of its after l polarization steps with a readth- polarization kernel is N = l, so the overall decoding complexity in this setting is O ( 2 N log N ). B. Depth The previous section descried a natural generalization of polar codes which use a roader polarization kernel. A further generalization, first explored in [6], [7], is to use a polarization step whose circuit is composed of -local gates and has depth d > (see Fig. c), which results in a convolutional transformation. A CP,d code, that is, a convolutional polar code with kernel readth and circuit depth d, is define similarly to a polar code with a kernel of size where each polarization step is replace y a stack of d polarization layers each shifted relative to the previous layer. Fig. c and Fig. d illustrates two realizations of convolutional polar codes. To analyze the decoding complexity, it is useful to introduce the concept of a causal cone. Given a circuit and a w-it input sequence u i+w i, the associated causal cone is defined as the set of gates together with the set of edges of this circuit whose it value depends on the value of u i+w i. Figure 4 illustrates the causal cone of the sequence u 3 for the code CP 2,2. Given a convolutional code s readth and depth d, define m(d,, w) to e the maximum numer of gates in the causal cone of any w-it input sequence of a single polarization step. Because a single convolutional step counts d layers, define m s (d,, w) as the numer of those gates in the causal cone which are in the s-th layer (counting from the top) of the convolution. For the first layer, have m (d,, w) = w +. This numer can at most increase y one for each layer, i.e., m s+ (d,, w) m s (d,, w) +, leading to a total numer of gates in the causal cone of a single polarization step m(d,, w) = d m s (d,, w) dm (d,, w) + s= w = d + d(d ) 2 d(d + ). (9) 2 Similarly, define the optimal decoding width w (, d) as the smallest value of w for which the causal cone of any w it sequence after one step of polarization contains at most w output its. Figure 4 illustrates that w = 3 for a CP 2,2 code since any 3 consecutive input its affect at most 6 consecutive its after one polarization step. Choosing a decoding width w (, d) thus leads to a recursive decoding procedure which is identical at all polarization steps. Since the ottom layer counts m d (d,, w) w + d gates, each acting on its,

Fig. 4. Graphical representation of the causal cone of u 3 in the CP 2,2 code. Only the gates in the shaded region receive inputs that depend on the sequence u 3. Similarly, the edges contained in the shaded region represent its at intermediate polarization steps whose value depends on sequence u 3. This shows that decoding a CP 2,2 code amouts to contracting a constant treewidth graph. The optimal width w = 3 and at most m(2, 2, w ) = 5 gates are involved per polarization step. we see that there are at most w +d w+d output its in the causal cone of a single polarization step. The optimal decoding width w is chosen such that this numer does not exceed w, thus w (, d) d. (0) Using this optimal value in Eq. 9 ounds the numer of rank- tensors that are contracted at each polarization layer, and each contraction has complexity 2. Here again, only a fraction / log N of these contractions differ at each step of successive cancellation decoding, leading to an overall decoding complexity C,d (N) = 2 m(, d, w ) w N log N O(2 dn log N). () Ref. [7] provides analytical arguments that the resulting convolutional polar codes have a larger asymptotic error exponent β > 2, and present numerical results showing clear performance improvement over standard polar codes at finite code lengths. These advantages come at the cost of a small constant increased decoding complexity IV. SIMULATION RESULTS Numerical simulations were performed to analyze the performance of codes readth and depth up to 4. The readth-2 kernel used was the CNOT, while the readth-3 and readth-4 kernels were 0 0 0 0 0 G 3 = 0, G 4 = 0 0 0 0, 0 0 0 where these are given as representations over F 2. It can easily e verified that these transformations are not permutations, so they can in principle e used to polarize [5]. Also, we choose a convolutional structure where each layer of gates is identical ut shifted y one it to the right (from top to ottom), c.f. Fig. d. Many others kernel and convolutional structures have een simulated, ut those gave the est empirical results. The encoding circuit G is used to define the code, ut the complete definition of a polar code must also specify the set of frozen its F, i.e. the set of its that are fixed to u i = 0 at the input of the encoding circuit G. In general, for a given encoding circuit G, the set F will depend on the channel and is chosen to minimize the error proaility under successive cancellation decoding. Here, a simplified channel selection procedure which uses an error detection criteria descried in the next section was used. All the simulations presented focus on the inary symmetric memoryless channel. A. Error detection Considering an error detection setting enales an important simplification in which the channel selection and code simulation can e performed simultaneously without sampling. In this setting, it is consider that a transmission error x N y N = x N + e is not detected if there exists a non-frozen it u i, i F c which is flipped while none of the frozen its to its right u j, j < i, j F have een flipped. In other words, an error is considered not detected if its first error location (starting from the right) occurs on a non-frozen it. Note that this does not correspond to the usual definition of an undetectale error which would e conventionally defined as an error which affects no frozen locations. By considering only frozen its to the right of a given location, the notion used is tailored to the context of a sequential decoder. Empirically, it was oserved that this simple notion is a good proxy to compare the performance of different coding schemes under more common settings. Denote P U (i) the proaility that the symol u i is the first such undetected error. Then, given a frozen it set F, the proaility of an undetected error is P U = i F P c U (i). This can e evaluated efficiently using the representation of the encoding matrix over R 2N as aove. For e F N 2, denote P(e) the proaility of a it-flip pattern e, viewed as a vector on R 2N. At the output of the symmetric channels with error proaility p, P T = ( p, p) N. Then ( ) i P U (i) = ( p, p) N G 0 ( 0 ) e (i ), (2) where here again e = ( ). In terms of tensor networks, this corresponds to the evaluation of the network of Fig. 2 with u i = and all u j = 0 for all j < i. Thus, this can e accomplished with complexity given y Eq.. Because the evaluation of Eq. 2 is independent of the set of frozen its, it can e evaluate for all positions i, selecting the frozen locations as the N K locations i with the largest value of P U (i). Then, the total undetected error proaility is the sum of the P U (i) over the remaining locations. This is equivalently the sum of the K smallest values of P U (i). The results are shown on Fig. 5a for various cominations of kernel readths and convolutional depth d. The code rate was 3, meaning that the plotted quantity is the sum of the N/3 smallest values of P U (i). Fig. 5 presents a suset of

Undetected error proaility 0 0 0 5 0 0 0 5 0 20 0 25 0 30 0 35 = 2, d = = 2, d = 2 = 2, d = 3 = 2, d = 4 = 3, d = = 3, d = 2 = 3, d = 3 = 3, d = 4 = 4, d = = 4, d = 2 = 4, d = 3 = 4, d = 4 0 2 0 3 Numer of its Undetected error proaality 0 0 3 0 5 0 7 0 9 0 0 3 0 5 0 7 = 2, l = 0 = 3, l = 7 = 3, l = 6 = 4, l = 5 0 50 00 50 200 250 300 Numer of operations Bit error rate 0 3 0 4 = 2, l = 8 0 5 = 3, l = 5 = 4, l = 4 0 25 50 75 00 25 50 75 200 Numer of operations Fig. 5. Numerical simulation results. Undetected error proaility under successive cancellation decoding for polar codes (d = ) and convolutional polar codes (d > ) for various kernel readths, plotted as a function of code size N = l y varying the numer of polarization steps l. The channel is BSC(/4) and the encoding rate is /3. Same thing as ut plotted as a function of their decoding complexity, c.f. Eq.. The numer of polarization steps l is chosen in such a way that all codes are roughly of equal size N(, l) = l 0 3. The dots connected y a line all have the same kernel readth ut show a progression of depth d =, 2, 3, 4, with d = appears on the left and corresponds to regular polar codes. (c) The it error rate for a BSC(/20) with an /3 encoding rate plotted as a function of the decoding The depth is specify similarly to y the connected dots. The numer of polarization steps is chosen to have roughly N 250 its. (c) the same data with parameters and l resulting in codes of roughly equal size N = l 0 3. This implies that codes with larger readth use fewer polarization steps. The undetected error proaility P U is then plotted against the decoding complexity, compute from Eq.. Notice that increasing the depth is a very efficient way of suppressing errors with a modest complexity increase. In contrast, increasing the readth actually deteriorates the performance of these finite-size code and increases the decoding B. Error correction For the symmetric channel, the frozen its were chosen using the error detection procedure descrie in the previous section. This is not optimal, ut it is sufficient for the sake of comparing different code constructions. Then, standard Monte Carlo simulations were done y transmitting the all 0 codeword sampling errors, using successive cancellation decoding and comparing the decoded message. The results are presented in Fig. 5c. The conclusions drawn from the error detection simulations all carry over to this more practically relevant setting. V. CONCLUSION We numerically explored a generalization of the polar code family ased on a convoluted polarization kernel given y a finite-depth local circuit. On practically relevant code sizes, it was found that these convoluted kernel offer a very interesting error-suppression vs decoding complexity trade-off compare to previously proposed polar code generalizations using road kernels. Empirically, no incentive were found to consider increasing oth the readth and the depth: an increasing depth alone offers a greater noise suppression at comparale complexity increase. It will e interesting to see what further gains can e achieved, for instance, from list decoding of convolutional polar codes. ACKNOWLEDGMENT This work was supported y Canada s NSERC and Quéec s FRQNT. Computations were done using Compute Canada and Calcul Quéec clusters. REFERENCES [] E. Arikan, Channel Polarization: A Method for Constructing Capacity- Achieving Codes for Symmetric Binary-Input Memoryless Channels, IEEE Transactions on Information Theory, vol. 55, no. 7, pp. 305 3073, Jul. 2009. [2] E. Arikan and E. Telatar, On the rate of channel polarization, in 2009 IEEE International Symposium on Information Theory, Jun. 2009, pp. 493 495. [3] E. Şaşoğlu, E. Telatar, and E. Arikan, Polarization for aritrary discrete memoryless channels, in 2009 IEEE Information Theory Workshop, Oct. 2009, pp. 44 48. [4] S. B. Korada and R. L. Uranke, Polar Codes are Optimal for Lossy Source Coding, IEEE Transactions on Information Theory, vol. 56, no. 4, pp. 75 768, Apr. 200. [5] S. B. Korada, E. Sasoglu, and R. Uranke, Polar Codes: Characterization of Exponent, Bounds, and Constructions, IEEE Transactions on Information Theory, vol. 56, no. 2, pp. 6253 6264, Dec. 200. [6] A. J. Ferris and D. Poulin, Branching MERA codes: A natural extension of classical and quantum polar codes, in 204 IEEE International Symposium on Information Theory, Jun. 204, pp. 08 085. [7] A. J. Ferris, C. Hirche, and D. Poulin, Convolutional Polar Codes, arxiv:704.0075 [cs, math], Apr. 207. [8] R. Orus, A Practical Introduction to Tensor Networks: Matrix Product States and Projected Entangled Pair States, Annals of Physics, vol. 349, pp. 7 58, Oct. 204. [9] J. C. Bridgeman and C. T. Chu, Hand-waving and Interpretive Dance: An Introductory Course on Tensor Networks, Journal of Physics A: Mathematical and Theoretical, vol. 50, no. 22, p. 22300, Jun. 207. [0] I. Arad and Z. Landau, Quantum Computation and the Evaluation of Tensor Networks, SIAM Journal on Computing, vol. 39, no. 7, pp. 3089 32, Jan. 200. [] G. Evenly and G. Vidal, Class of Highly Entangled Many-Body States that can e Efficiently Simulated, Physical Review Letters, vol. 2, no. 24, Jun. 204.