Polar codes for reliable transmission

Size: px

Start display at page:

Download "Polar codes for reliable transmission"

Karen Malone
5 years ago
Views:

1 Polar codes for reliable transmission Theoretical analysis and applications Jing Guo Department of Engineering University of Cambridge Supervisor: Prof. Albert Guillén i Fàbregas This dissertation is submitted for the degree of Doctor of Philosophy Trinity Hall June 2015

3 I would like to dedicate this thesis to my parents, who provide me with two essential things: life and love.

5 Declaration I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. This dissertation contains fewer than 65,000 words including appendices, bibliography, footnotes, tables and equations and has fewer than 150 figures. Jing Guo June 2015

7 Acknowledgements First, I would like to express the sincerest gratitude to my supervisor Prof. Albert Guillén i Fàbregas. Discussions with him have been very enlightening. His sharp and rigorous thought have illuminated my path into research. I have benefited a lot from his attitude towards life and work. I would like to thank him for his belief in me that has allowed me to do research in my own interests. I would like to thank him for his patience and encouragement when I get stuck in research. It was a great pleasure to study at the University of Cambridge. I have enjoyed my time spent here, both within and outside my research. I would like to thank Prof. Paul H. Siegel at the University of California, San Diego (UCSD) for the collaboration with his group and inviting me to visit UCSD. I would like to thank all my collaborators, Jossy Sayir, Minghai Qin, Aman Bhatia, and Paul H. Siegel, for their great contributions to my dissertation. I would like to thank Minghai Qin for proofreading this thesis in great detail and providing many useful suggestions that dramatically improved the quality of the thesis. I would also like to thank Jonathan Scarlett and Jossy Sayir for helping with proofreading. I would like to thank the staff in Division F, in particular, Rachel Fogg, Lorraine Baker, and Phill Richardson for their kind help in making my life in the department more enjoyable and easier. I would like to thank secretaries at the Universitat Pompeu Fabra (UPF), Joana Clotet, Vanessa Jiménez, and Beatriz Abad for their kind help during my stay at UPF. I would like to thank the current members and alumni of the group: Alfonso Martinez, Gonzalo Váquez Vilar, Jossy Sayir, Adrià Tauste Camp, Tobias Koch, Alex Alvarado, Li Peng, Taufiq Agustian Asyhari, Jonathan Scarlett, and Seçkin Anıl Yıldırım. I have benefited a lot during the lunch time discussions and conversations with them. In particular, I learned a lot from Alfonso. His capacious knowledge and deep sympathy towards the underprivileged people has had a great influence on me. I would like to thank all my friends, including Chong Chen, Lan Jiang, Jonathan Scarlett, Xiao Rong, Yingsong Zhang, Zhihan Xu, Bo Zhen, Yining Chen, Wei Wu,

8 viii Fei Jin, Fei Wang, Ningjun Jiang, Kai Gu, Li Peng, Xiaoke Yang, Wenjing Yan, Shihui Guo, Xuefei Wu, Minghai Qin, Keqian Yan, Xiaozhi Fu, Mengshi Wang, Ruizhi Liao, Zhouyue Wang, Xinyi Wu, Ling Tan, etc. I have shared many precious moments with them. I would like to thank Chong Chen, in particular, for always being there, walking me through difficulties. I would like to thank Minghai Qin for helpful discussions and constant support during the last two years of my PhD. Finally, I would like to express my deepest gratitude to my parents and my grandparents for their continued love and support throughout all stages of my life. This thesis was supported in part by the Cambridge Overseas Trust, the Chinese Scholarship Council and the European Research Council under ERC grant agreement

9 Abstract Polar codes, as the first provable capacity-achieving codes for binary discrete memoryless channels with low encoding and decoding complexity, have attracted considerable attention recently. We first study the channel polarization phenomenon over arbitrary-input discrete memoryless channels. We show that under the restriction that the channel has zero zero-error capacity, channel polarization occurs for any input alphabet set that forms a group. We also discuss the set of channels to which the virtual channels converge. We then study a family of polar codes whose frozen set is chosen such that one discards the bit-channels for which the mutual information falls below a certain (fixed) threshold. We show that if the threshold, which might depend on the code length, is bounded appropriately, a coding theorem can be proved for the underlying polar code. We also give accurate closed-form upper and lower bounds on the minimum distance of the resulting code when the design channel is a binary erasure channel. Furthermore, we investigate the code constructions of polar codes in the finite blocklength regime. We propose a concatenation scheme that utilizes the bit-channels that are not fully polarized. Simulation results show that the concatenated polar codes outperform the conventional polar codes under belief propagation decoding. Moreover, by tracking how the messages are updated during belief propagation decoding, we propose a log-likelihood ratio oriented criterion to select the frozen set. This criterion shows advantages over the conventional construction under both soft-output and hardoutput decoding. Finally, we study the sphere decoding algorithms of polar codes, which can achieve the maximum-likelihood performance. We propose stricter branching conditions compared to the previous approaches, that significantly reduce the search complexity. Simulation results report two orders of magnitude improvement over the existing sphere decoding algorithms. Based on the recursive structure of the generator matrix of polar codes, we also propose a bit-reversed decoding order which can further reduce the complexity in the low-to-medium SNR regime.

11 Table of contents List of figures List of tables Nomenclature xv xix xxi 1 Introduction Channel model Channel coding Channel coding theorem A brief review of existing coding schemes Polar codes Dissertation overview Notations A review of polar codes Preliminaries Channel transformation Channel combining Channel splitting Recursive channel transformation Channel polarization Polar code encoding SC decoding of polar codes Coding theorems Practical code constructions Summary

12 xii Table of contents 3 Channel polarization for arbitrary-input DMCs Introduction Entropy inequality for prime-input DMCs Preliminaries Zero-error capacity Basic algebraic structures Entropy inequalities for arbitrary-input DMCs Channel polarization for arbitrary-input DMCs Channel polarization over groups Channel polarization over monoids Proofs Proof of Lemma Proof of Lemma Proof of Lemma Proof of Lemma Proof of Lemma Proof of Theorem Conclusion Fixed-threshold polar codes Introduction Preliminaries Fixed-threshold construction A coding theorem Minimum distance Proofs Proof of Lemma Proof of Theorem Proof of Theorem Conclusion Belief propagation decoding of polar codes through concatenation Introduction Factor graph Message updating rules Factor graph of standard polar codes Construction of concatenated polar codes

13 Table of contents xiii Code construction and factor graph representation BP decoding of concatenated polar codes Flooding BP SCAN BP Early termination check Simulation results Channel ordering Outer LDPC code Outer convolutional code Results Conclusion Polar code constructions adapted for BP + and SCL decoders Introduction Analysis of LLRs with fixed iterations under BP + decoding Simulation results of LLR-oriented constructions under BP + decoding BP + decoding BP + decoding with guessing LLR-oriented construction under SCL decoding SCL decoding Weight enumerating functions Numerical results Conclusion Efficient sphere decoding of polar codes Introduction Modulation and channel model Sphere decoding algorithms Sphere decoding with fixed lower bounds Simulations Sphere decoding with dynamic lower bounds Simulation results Sphere decoding with bit-permute orders Definitions and properties of bit-permute orders An example of bit-permute orders Analysis on bit-permute orders Conclusion

14 xiv Table of contents 8 Conclusions and future work 123 References 125

15 List of figures 1.1 The basic communication system Channel polarization for BEC ǫ = Vector channel W Vector channel W Vector channel W N Encoding procedure for an (8, 4) polar code, BEC with ǫ = A degrading operation One step channel transformation Channels belong to C Setup conditioned on U 1 = Channel described by case Channel described by case Thresholds of Arıkan s fixed-rate polar codes for R = 0.3, 0.4 and fixedthreshold polar codes with threshold θ N over the BEC with I(W) = Rate convergence for different threshold functions θ N over a BEC with I(W) = Minimum distance of fixed-threshold polar codes with θ N = 1 1 N 2 N β with β = 2 over the BEC Minimum distance d min for Arıkan s fixed-rate polar codes with R = 0.35, 0.49 and for fixed-threshold polar code with threshold function θ N = 1 2 N β with β = 2 for a BEC with I(W) = Factor graph of a (7, 4) Hamming code Message propagation between CNs and VNs Factor graph of a standard polar code of length N =

16 xvi List of figures 5.4 Thresholds for sorted Bhattacharyya parameters Z(W (i) N ). The underlying channel is an AWGN channel with E b N 0 = 0 db. Blocklength N = Factor graph of concatenated polar codes of length N = FER of BP decoding with 2 schedulings (flooding and SCAN) over AWGN channels. All codes have length N = 4096 and overall rate R = 1. The concatenated LDPC code is a (3,5)-regular Tanner code Average number of iterations for early termination enabled BP + decoding, (a) flooding, (b) SCAN. All codes have length N = 4096 and rate R = FER comparison of fixed-iteration BP decoding and early termination enabled BP + (flooding). All codes have length N = 4096 and overall rate R = 1. The concatenated LDPC code is a (3,5)-regular Tanner code FER comparison of fixed-iteration BP decoding and early termination enabled BP + (SCAN). All codes have length N = 4096 and overall rate R = 1. The concatenated LDPC code is a (3,5)-regular Tanner code FER and BER comparison of different schemes: 1) early termination enabled BP (flooding); 2) early termination enabled BP (flooding) with a concatenated Tanner code; 3) successive cancellation (SC) decoding; 4) early termination enabled BP (SCAN); 5) early termination enabled BP (SCAN) with a concatenated Tanner code; 6) early termination enabled BP (SCAN) with a concatenated convolutional code. All codes have length N = 4096 and overall rate R = 1 2. The polar code is constructed based on [75] at E b N 0 = 0 db Average number of iterations for early termination enabled BP + decoding, (a) flooding and (b) SCAN. All codes have length N = 4096 and rate R = FER and BER comparison of different concatenation schemes. All decoders are based on early termination enabled BP + (SCAN) decoding over the AWGN channel at SNR = 4 db. All codes have length N = 4096 but unequal rate FER of polar codes and RM codes over AWGN channels under SCL decoding with list size L = 128. The codes have blocklength N = 128 and rate R = 1 2. The polar code is constructed based on [75] at E b N 0 = 3 db

17 List of figures xvii 6.2 LLR analysis of BP (SCAN) over the AWGN channel at E b N 0 = 2.25 db of a polar code with N = 1024 and R = 1. The polar code is constructed 2 based on [75] at E b N 0 = 2.25 db LLR analysis of BP (SCAN) over the AWGN channel at E b N 0 = 2.25 db using a LLR-oriented construction by swapping 12 bit-channels of the conventional construction [75] FER of BP (SCAN) over AWGN channels with conventional and LLRoriented constructions. All codes have length N = 1024, and the rate is R = 0.5, 0.86, respectively FER comparison of BP + (SCAN) with guessing over AWGN channels. Both codes have length N = 4096 and rate R = Breadth-first search of a decoding tree with list size L = Union bounds and FER comparison of LLR-oriented and conventional constructions of polar codes over AWGN channels. Both codes have length N = 1024 and rate R = 1. The decoder is a SCL decoder of list 2 size FER comparison of LLR-oriented and conventional constructions of polar codes with SCL decoders over AWGN channels. All codes have length N = 1024 and rate R = 1. The conventional construction is 2 based on [75] optimized at E b N 0 = 2.25 db. The SCL decoders have list sizes 16 and 32. The CRC-8 is an 8-bit CRC from [74] FER comparison of LLR-oriented and conventional constructions of polar codes with CRC-aided SCL decoding over AWGN channels. Both codes have length N = 1024 and rate R = The SCL decoder has list size 64. The CRC-8 is an 8-bit CRC from [74] Union bounds and FER comparison of LLR-oriented and conventional construction of polar codes with CRC-aided SCL decoding over AWGN channels. Both codes have length N = 1024 and R = Average number of nodes visited for the (64, 57) RM code Error rate performance for the (64, 57) RM code Average number of nodes visited for (64, K) RM codes with R = K 64 over AWGN channels with E b N 0 = 6 db Average number of nodes visited for (N, RN ) polar codes with R = 0.89 over an AWGN channel with E b N 0 = 8 db Average number of nodes visited for (64, K) RM codes over AWGN channels with E b N 0 = 6 db

18 xviii List of figures 7.6 Average number of nodes visited for a (64, 32) RM code for 2 decoding orders (fixed lower bounds applied) over AWGN channels: 1) natural order; 2) bit-reverse order Average number of nodes visited for a (64, 32) polar code over AWGN channels. 4 decoding orders of maximum and minimum sums are shown compared to the average of all n! decoding orders

19 List of tables 6.1 Number of weight-16 codewords for polar codes with N = 1024 and R = 1 with conventional and LLR-oriented constructions

21 Nomenclature Roman Symbols C I N O Channel capacity Mutual information Blocklength Order notation for sequences Other Symbols R X Y Z Z + Set of real numbers Input alphabet Output alphabet Set of integers Set of positive integers Acronyms / Abbreviations AWGN Additive white Gaussian noise B-DMC Binary discrete memoryless channel BEC Binary erasure channel BER Bit error rate BP BSC Belief propagation Binary symmetric channel

22 xxii Nomenclature CRC Cyclic redundancy check DMC Discrete memoryless channel FER Frame error rate i.i.d. Independent and identically distributed LDPC Low-density parity-check LLR ML SCL SC SD Log-likelihood ratio Maximum-likelihood Successive cancellation list Successive cancellation Sphere decoding

23 Chapter 1 Introduction One of the most important questions in information theory is at what rate information bits can be reliably transmitted over a noisy channel. In real-world communication systems, data transmission is inevitably affected by noise. In order to recover the original data after transmission in a noisy environment, redundancy needs to be added to the data before transmission. The procedure of adding redundancy is called channel coding, and is essential in communication systems. An efficient coding scheme adds redundancy in such a way that the coding rate is high while the probability of error is kept low. 1.1 Channel model The mathematical framework of a communication system is shown in Fig Assuming a message m is to be transmitted, it is mapped into a data sequence x by the encoder. The transmitted data sequence x is corrupted by noise during transmission, thus the received sequence y is different from x. The output of the decoder ˆm is an estimation of the message transmitted m base on the received sequence y. The channel is the physical medium over which data is transmitted. It can be modeled Noise m x y Encoder Channel Decoder ˆm Fig. 1.1 The basic communication system.

24 2 Introduction by a transition probability function W(y x) that depends on the communication environment. W(y x) is defined as the probability of observing y when the sequence x is sent. Throughout the thesis, we will focus on discrete memoryless channels (DMCs), defined as follows. Definition [44] A discrete memoryless channel (DMC), denoted by W : X Y, consists of two finite sets X and Y and a collection of probability mass functions W(y x), one for each x X, such that for every x and y, W(y x) 0, and for every x, y W(y x) = 1, with the interpretation that x is the input and y is the output of the channel. The probability distribution of the output at a given time depends only on the input at that time and is conditionally independent of previous channel inputs or outputs, i.e. for k multiple channel uses, the probability of observing y k when x k 1 is sent and y k 1 1 is observed is W ( ) y k x k 1y1 k 1 = W(yk x k ), k = 1, 2,... (1.1) Furthermore, if a DMC is used without feedback, we have W ( ) y k 1 x k k 1 = W(y i x i ), k = 1, 2,... (1.2) i=1 Throughout this thesis, DMCs are used without feedback. A transition matrix of a DMC can be used to describe the transition probabilities between the input and output alphabet set, where the entry in the ith row and the jth column is the conditional probability that the jth element in Y is received when the ith element in X is sent. A DMC can be classified by the symmetry 1 of its transition matrix. Definition A DMC is said to be symmetric if any two rows of its transition matrix are permutations of each other, and any two columns are permutations of each other. For example, a DMC with transition matrix is a symmetric DMC W(y x) = (1.3) There are various definitions of symmetry, but we will follow the definition in [14, Chapter 8]

25 1.1 Channel model 3 Definition A DMC is said to be weakly symmetric if any two rows of its transition matrix are permutations of each other, and each column sums to the same value. For example, a DMC with transition matrix 1 W(y x) = (1.4) is a weakly symmetric DMC. A DMC can also be classified by the cardinality of its input alphabet set. Definition If the input alphabet of a DMC only has two elements, the channel is said to be a binary-input discrete memoryless channel (B-DMC). Definition If the cardinality of the input alphabet of a DMC is a prime number, the channel is said to be a prime-input discrete memoryless channel (prime-input DMC). The mutual information provides a measure of the dependence between two random variables, and it is a key quantity in coding and information theory. Definition The mutual information I(X; Y ) of a DMC W : X Y and input distribution p(x) is defined as I(X; Y ) = W(y x) p(x)w(y x) log x X y Y W(y x )p(x ), (1.5) x X where p(x) is an input distribution, namely, the distribution of X. Remark 1. The base of the logarithm will be set to 2 by default. Other bases will be explicitly shown or stated. While the mutual information depends on the input distribution, we are particularly interested in the mutual information with the uniform input distribution. Definition The symmetric mutual information of a DMC W : X Y is defined as I(W) = y Y x X 1 W(y x) W(y x) log q x X where q is the cardinality of the input alphabet set X. 1 q W(y x ), (1.6)

26 4 Introduction As we will see below, the mutual information maximized with respect to the input distribution also plays a fundamental role in channel coding. Definition The information capacity of a DMC W : X Y is defined as C = max I(X; Y ), (1.7) p(x) where the maximum is over the set of all input distributions p(x). 1.2 Channel coding As long as the channel is not noiseless, data may be corrupted during transmission. Channel coding is an effective method to combat noise. We first give definitions related to channel codes. This section is mainly based on [14, Chapter 8]. Definition An (M, N) code for the channel W : X Y consists of the following: 1. An index set {1, 2,..., M}, representing messages. 2. An encoding function f: {1, 2,..., M} X N. A codeword is generated by applying x = f(m), m {1, 2,..., M}. The set of codewords is called the codebook. 3. A decoding function g: Y N {1, 2,..., M}, which is a deterministic rule that assigns an estimated message to each possible received vector. We use λ m to denote the probability of error when the index m is transmitted, i.e., λ m = P ( g(y N 1 ) m x N 1 = f(m) ). (1.8) In the following, we will define some important terminology related to an (M, N) code. Definition The average probability of error P e for an (M, N) code is defined as P e = 1 M M λ m. (1.9) m=1 Definition The maximal probability of error λ (N) as λ (N) = for an (M, N) code is defined max λ m. (1.10) m {1,2,...,M}

27 1.2 Channel coding 5 Definition The rate R of an (M, N) code is defined as R = log M bits per transmission. (1.11) N Definition A rate R is said to be achievable if there exists a sequence of ( 2 NR, N ) codes such that the maximal probability of error λ (N) tends to 0 as N. Definition The operational capacity of a discrete memoryless channel is the supremum of all achievable rates. The operational capacity measures the highest rate at which one can communicate reliably Channel coding theorem In the landmark work [70], Shannon proved that the operational capacity is equal to the information capacity. Theorem (The channel coding theorem [70]). All rates below the information capacity C are achievable. Specifically, for every rate R < C, there exists a sequence of ( 2 NR, N ) codes with maximum probability of error λ (N) 0 when N. Conversely, any sequence of ( 2 NR, N ) codes with λ (N) 0 must have R C. To prove the achievability of the information capacity, Shannon calculated the average probability of error P e, averaged over a randomly generated codebook. He showed that P e 0 for any R < C when N. The vanishing P e guarantees the existence of at least one good code that achieves the information capacity. By discarding the worst half of the codewords of the good code, the maximum probability of error λ (N) under joint typicality decoding converges to 0 for any R < C when N. Remark 2. Since the information capacity and the operational capacity are equal, in the rest of the thesis, we will simply use the word capacity to refer to either definition according to the context, when there is no possibility of confusion.

28 6 Introduction 1.3 A brief review of existing coding schemes Since Shannon proved the existence of capacity-achieving codes, there has been a tremendous effort in searching for such codes. In addition to low probabilities of error, practical codes should have low encoding and decoding complexity. In this section, we briefly review existing low-complexity coding schemes. We refer the reader to [15] for a detailed history of channel coding theory and applications. One direction of research is focused on finding linear codes with good algebraic properties, such as large minimum Hamming distance. Hamming codes [28], Golay codes [26], Reed-Muller codes [59, 49], Reed-Solomon codes [60], and BCH codes [11, 31, 9, 43] are important examples in this category. The constructions of Reed-Muller codes bear some resemblance to those of polar codes, but they seek to maximize the minimum distance instead of the quality measures considered in polar coding. Another direction of research, instead of finding specific codes with good performance under the worst scenario, is focused on finding families of codes with good performance on average. Convolutional codes, concatenated codes (such as turbo codes [46]), and low-density parity-check (LDPC) codes [23] fall within this category. Polar codes have advantages over the above coding schemes in the sense of being provably capacity achieving with low encoding and decoding complexity. Spatial coupled LDPC codes have been introduced recently, and are also capacity achieving, but the proofs are typically more involved. 1.4 Polar codes Despite the excellent performance of turbo codes and LDPC codes in practice, none of the aforementioned codes can be proved to achieve capacity of channels except for the binary erasure channel (BEC). Polar codes, introduced by Arıkan [3], are the first provably capacity-achieving codes with low encoding and decoding complexity for symmetric B-DMCs. The encoding and decoding complexities of polar codes are O(N log N), where N is the blocklength of the codes. The analysis and construction of polar codes are summarized as follows: (1) Given a symmetric B-DMC, virtual channels between the bits at the input of a linear encoder and the channel output sequence are created, such that the capacity of these virtual channels polarizes to either zero or one as the blocklength tends to infinity; the proportion of virtual channels with capacity close to one converges to the capacity of the original channel. This phenomenon is termed as channel polarization. (2) By transmitting data bits through the noiseless

29 1.5 Dissertation overview 7 virtual channels, polar codes achieve the capacity under successive cancellation (SC) decoding. Fig. 1.2 demonstrates how the symmetric mutual information of virtual channels polarizes as the blocklength grows. We will give a detailed review on polar codes in Chapter N = 2 8 N = 2 10 N = Mutual information Normalized virtual channel index Fig. 1.2 Channel polarization for BEC ǫ = Dissertation overview In Chapter 2, we give a literature review on polar codes designed for B-DMCs. We start from a theoretical point of view on why polar codes can achieve the symmetric mutual information of B-DMCs when the blocklength tends to infinity. Then we move on to practical constructions, including encoding and decoding schemes of polar codes in finite blocklength regime. In Chapter 3, we study the channel polarization of arbitrary-input DMCs. We show that if the original channel has zero zero-error capacity, the virtual channels will polarize towards channels with positive zero-error capacity or channels with zero capacity. The main contribution of this chapter is that we provide a simpler proof for the channel polarization theorem of arbitrary-input DMCs. The research work in this

30 8 Introduction chapter has been published in part in the paper: Jing Guo, Jossy Sayir, Minghai Qin and Albert Guillén i Fàbregas, An alternative proof of channel polarization for channels with arbitrary input alphabets, accepted by 53rd Annual Allerton Conference on Communication, Control, and Computing. In Chapter 4, we study a family of polar codes whose construction rule is such that it discards the virtual channels for which the mutual information falls below a certain (fixed) threshold. We show that if the threshold, which might depend on the code length, is bounded appropriately, a coding theorem can be proved for the underlying polar code. We also give accurate closed-form upper and lower bounds on the minimum distance of the resulting code when the original channel is a binary erasure channel. The research work in this chapter has been published in part in the paper: Jing Guo, Albert Guillén i Fàbregas and Jossy Sayir, Fixed-threshold polar codes, in Proc. IEEE International Symposium of Information Theory (ISIT) 2013, Istanbul, Turkey, July 2013, pp In Chapter 5, we propose a concatenated polar coding scheme employing an inner polar code and an outer LDPC code or a convolutional code for intermediate-quality virtual channels coupled with belief propagation (BP) decoding. We also propose an early termination method that reduces the decoding complexity. Both parallel and sequential updating schedules on polar decoding graphs are considered. Performance comparisons between concatenated polar codes and standard polar codes are provided. The research work in this chapter has been published in part in the paper: Jing Guo, Minghai Qin, Albert Guillén i Fàbregas and Paul H. Siegel, Enhanced belief propagation decoding of polar codes through concatenation, in Proc. IEEE International Symposium of Information Theory (ISIT) 2014, Honolulu, HI, USA, July 2014, pp In Chapter 6, we analyze how the messages passed through the decoding graph evolve during the BP decoding procedure. A construction of polar codes that is suitable for BP decoding are proposed based on the analysis. We also analyze the Hamming weight enumerating function of the codes formed by this construction. Simulation results show that this construction also provides better performance than the standard construction under successive cancellation list (SCL) decoding. In Chapter 7, we propose an efficient sphere decoding (SD) algorithm for polar codes that achieves the maximum-likelihood (ML) performance. We improve standard SD branching conditions by computing lower bounds on the optimal decoding metric. Both fixed and dynamic lower bounds are considered. We also propose an alternative decoding order based on the structure of polar codes, which further reduces

31 1.6 Notations 9 the decoding complexity in the low-to-medium SNR regime. The research work in this chapter has been published in part in the paper: Jing Guo and Albert Guillén i Fàbregas, Efficient sphere decoding of polar codes, accepted for publication by IEEE International Symposium of Information Theory (ISIT) In Chapter 8, we conclude the thesis by summarizing the main contributions of each chapter and by discussing potential interesting open problems for future research. 1.6 Notations Throughout the thesis, we will use the following notations. We use upper case letters such as X, Y, U to denote random variables, and lower case letters such as x, y, u to denote their realizations. We use W : X Y to denote the discrete memoryless channel (DMC) with input alphabet X and output alphabet Y. We use W(y x) to denote the transition probability of a channel W. We use bold symbols such as x, y, u to denote vectors. Row vectors are assumed. We use 0 and 1 to denote the all-zero vector and the all-one vector, respectively. We use d H (u, v) to denote the Hamming distance between binary vectors v and u. We define [b] = {1,..., b} for b Z +. We use F to denote a subset of [N] and F c to denote its complement. We use F to denote the cardinality of F. We let u F denote the sub-vector of u with indices i F. We use u b a to denote the sub-vector (u a,..., u b ) for 1 a b N. We use g i,j to denote the element in the ith row and jth column of a matrix G. We use G T to denote the transpose of a matrix G. We use G 1 to denote the inverse of a non-singular matrix G.

33 Chapter 2 A review of polar codes In this chapter, we give a literature review of polar codes designed for B-DMCs from both the theoretical and practical points of view. We will first present the channel transformation from which virtual channels are created. Then we will introduce the channel polarization phenomenon, based on which polar codes are constructed. Finally, we will move on to practical constructions, encoding, and decoding schemes of polar codes in finite blocklength regime. This chapter lays the foundations for the rest of the thesis. 2.1 Preliminaries In this chapter, we assume the input alphabet set X to be {0, 1}. We first introduce two channel parameters that are of primary interest. One is the symmetric mutual information defined in the previous chapter; we repeat the definition here. Definition The symmetric mutual information of a B-DMC W : X Y is defined as I(W) = y Y x X 1 2 W(y x) log W(y x) 1 W(y 0) + 1 (2.1) W(y 1). 2 2 The other one is the Bhattacharyya parameter, which measures the reliability of the channel for a single use. Definition The Bhattacharyya parameter of a B-DMC W : X Y is defined as Z(W) = y Y W(y 0)W(y 1). (2.2) A relationship between I(W) and Z(W) is stated in the following proposition.

34 12 A review of polar codes Proposition ([3]). For any B-DMC W : X Y, we have 2 I(W) log 1 + Z(W), (2.3) I(W) 1 Z(W) 2. (2.4) Proposition implies that if Z(W) is close to zero then I(W) is close to one, whereas if Z(W) is close to one then I(W) is close to zero. We now introduce a matrix operation, based on which the generator matrix of polar codes is constructed. Definition The Kronecker product of an m n matrix A and a k l matrix B is defined as a 11 B a 1n B A B = (2.5) a m1 B a mn B We use A n to denote Kronecker product of the matrix A by itself n times, i.e., 2.2 Channel transformation A n = A A... A. (2.6) }{{} n Given a B-DMC, the virtual channels are created by a channel transformation that contains two steps: channel combining and channel splitting [1]. N copies of the original channel W are transformed into N virtual channels W (i) N, which have some good properties that enable polar codes to achieve the symmetric mutual information. W (i) N, i [N] is also termed as the ith bit-channel of W. We will discuss the channel transformation in detail in this subsection and the properties of bit-channels in the next subsection Channel combining Channel combining is a step that combines copies of the original channel W into a vector channel W N : X N Y N. The vector channel W N is the virtual channel between the input sequence u N 1 to a linear encoder and the output sequence y N 1 of N copies of the original channel W. We use W N : X N Y N to denote the vector

35 2.2 Channel transformation 13 channel between the input sequence x N 1 and the output sequence y N 1 of N copies of the original channel W. The transition probabilities of the channels W N, W N and W are related by W N ( y N 1 u N 1 ) = W N ( y N 1 x N 1 ) (2.7) N = W(y i x i ). (2.8) i=1 The linear encoder that maps u N 1 x N 1 can be represented by a square matrix G N, which is created by applying the Kronecker product to a base matrix G 2 = 1 0 (2.9) 1 1 n = log(n) times. Notice that G can be constructed in a recursive manner as follows: G N = G 2 N 2 (. = G 2 N 4 (2.10) ) 2 (2.11) = G n 2. (2.12) This enables us to define the procedure of channel combining in a recursive fashion. Let denote the bitwise XOR operation, i.e., for two vectors u N 1 and v N 1, u N 1 v N 1 = (u 1 v 1,..., u N v N ). (2.13) For the first step, we combine two copies of the original channel W into W 2 (see Fig. 2.1). The mapping from u 2 1 x 2 1 can be written as x 2 1 = u 2 1G 2 (2.14) = (u 1 u 2, u 2 ). (2.15) For the second step, we combine two copies of W 2 into W 4 (see Fig. 2.2). The mapping from u 4 1 x 4 1 can be written as x 4 1 = u 4 1G 4 (2.16)

36 14 A review of polar codes Fig. 2.1 Vector channel W 2. Fig. 2.2 Vector channel W 4. = u 4 1G 2 2 (2.17) = ( ) u 2 1, u 4 G (2.18) G 2 G 2 = (( u 2 1 u 4 3) G2, u 4 3G 2 ). (2.19) For the nth step, we combine two copies of W N into W N where N = 2 n (see 2 Fig. 2.3). The mapping from u N 1 x N 1 can be written as x N 1 = u N 1 G N (2.20) = u N 1 G 2 N 2 (( = u N 2 1 u N N 2 ) G N 2, u N N G N 2 2 In Arıkan s original paper [3], the input sequence u N 1 matrix B N, i.e., x N 1 (2.21) ). (2.22) is permuted by a permutation = u N 1 B N G n 2, (2.23) where B N is a permutation matrix. Since the permutation only serves as a reordering of the indices of (x 1,..., x N ) and it does not affect the properties of polar codes, we

37 2.2 Channel transformation 15 Fig. 2.3 Vector channel W N. skip this permutation for simplicity of presentation throughout the thesis. Now we can rewrite Eq. (2.7) as ( ) W N y N 1 u ( ) N 1 = W N y N 1 u N 1 G N. (2.24) Channel splitting Having synthesized the vector channel W N, the next step is channel splitting. This involves splitting the vector channel W N into N bit-channels W (i) N : X Y N X i 1. The transition probability of the bit-channel W (i) N is defined as W (i) N (y N 1, u1 i 1 1 u i ) = 2 W ( ) N 1 N y N 1 u N 1. (2.25) u N i+1 X N i The corresponding Bhattacharyya parameter Z(W (i) N ) is defined as Z(W (i) N ) = y N 1 YN u i 1 1 X i 1 W (i) N ( y N 1, u i ) W (i) ( N y N 1, u i ). (2.26) The bit-channel W (i) N decoding the ith bit u i with perfect knowledge of channel outputs y N 1 is the channel that a successive cancellation decoder sees when and u i 1 1.

38 16 A review of polar codes Recursive channel transformation So far, we have described the procedures of channel combining and splitting, the necessary steps to obtain the bit-channels W (i) N. Due to the special structure of G N, the channel transformation can be done in a recursive fashion. Let u N 1 be the input sequence and y N 1 be the output sequence. A single step channel transformation generates a pair of binary input channels ( W (1) 2, W (2) ) 2 from two independent copies of the channel W. For the first step, given an original channel W, the pair of bit-channels ( (1) W 2, W (2) ) 2 can be described through the transition probabilities W (2) 2 W (1) ( ) 2 y 2 1 u 1 = 2 W(y 1 u 1 u 2 )W(y 2 u 2 ), (2.27) ( ) y 2 1 1, u 1 u 2 = 2 W(y 1 u 1 u 2 )W(y 2 u 2 ). (2.28) u 2 1 We use W W to denote the transformation of two copies of the channel W defined by Eq. (2.27) and W W to denote the transformation of two copies of the channel W defined by Eq. (2.28). At the ith step, we have synthesized 2 i 1 bit-channels W (j) 2 : X Y i 1 2(i 1) X j 1, j = 1,..., 2 i 1. By applying a single step channel transformation to two copies of the bit-channel W (j) 2, we could obtain a pair of bit-channels ( W (2j 1) i 1 2, W (2j) ) i 2, where i W (2j 1) 2 = W (j) W (j), i 2 (i 1) 2 (i 1) Eq. (2.29) can be expanded as W (2j) 2 i = W (j) 2 (i 1) W (j) 2 (i 1). (2.29) W (2j 1) 2 i ( y 2 i 1, u 2j 2 1 u 2j 1 ) = u 2j 1 2 W (j) 2 i 1 ( y 2 i 1 1, u 2j 2 1,o u 2j 2 1,e u 2j 1 u 2j ) W (j) 2 i 1 ( y 2 i 2 i 1 +1, u 2j 2 1,e u 2j ), (2.30) W (2j) 2 i ( y 2 i 1, u 2j 1 1 u 2j ) = 1 2 W (j) 2 i 1 ( y 2 i 1 1, u 2j 2 1,o u 2j 2 1,e u 2j 1 u 2j ) W (j) 2 i 1 ( y 2 i 2 i 1 +1, u 2j 2 1,e u 2j ). (2.31) Here u 2j 2 1,o = (u 1, u 3,, u 2j 1 ) and u 2j 2 1,e = (u 2, u 4,, u 2j 2 ). Note that Eq. (2.30) and Eq. (2.31) are identical to Eq. (2.27) and Eq. (2.28) if we apply the following substitutions: u 2j 1 u 1, ( y 2i 1 1, u 2j 2 1,o u 2j 2 ) 1,e y1,

39 2.3 Channel polarization 17 u 2j u 2, ( y 2i 2 +1, u 2j 2 ) i 1 1,e y2. Thus we could first transform N copies of the original channel W into N 2 bit-channel W (1) 2 and W (2) 2 each, then into N 4 and W (4) 4, till we obtain N bit-channels W (i) N, i = 1,..., N. copies of (1) copies of bit-channel W 4, W (2) 4, W (3) 4, 2.3 Channel polarization In this section, we introduce the concept of channel polarization, which is fundamental in proving that polar codes can achieve the symmetric mutual information of B-DMCs. We first prove that the channel transformation preserves the overall symmetric mutual information. According to the chain rule for mutual information, we have I ( ) ( ) ( ) Y N 1 ; X N 1, U N 1 = I Y N 1 ; X N 1 + I Y N 1 ; U N 1 X N 1 (2.32) = I ( ) ( ) Y N 1 ; U N 1 + I Y N 1 ; X N 1 U N 1. (2.33) Since we have I ( Y N 1 ; U N 1 X N 1 I ( Y N 1 ; X N 1 ) ( ) = I Y N 1 ; X N 1 U N 1 = 0, (2.34) ) ( ) = I Y N 1 ; U N 1. (2.35) Thus, I(W N ) = I ( Y N 1 ; U N 1 = I ( Y N 1 ; X N 1 ) ) (2.36) (2.37) = NI (W). (2.38) This shows that the channel transformation preserves the sum of N copies of the original channel s symmetric mutual information. However, the symmetric mutual information of the bit-channel W (i) N, i [N] is different from I(W). The following theorem shows that as n, the symmetric mutual information of each individual channel converges almost surely to either 0 or 1. Theorem ([3]). For any B-DMC W, the bit-channels W (i) N polarize in the sense that, for any fixed δ (0, 1), as N tends to infinity through powers of two, the fraction of indices i {1,..., N} for which I ( W (i) ) N (1 δ, 1] tends toward I(W) and the

40 18 A review of polar codes fraction for which I ( W (i) ) N [0, δ) tends toward 1 I(W). That is, lim N lim N i [N] : I ( W (i) ) N (1 δ, 1] N i [N] : I ( W (i) N N ) [0, 1 δ) = I(W), (2.39) = 1 I(W). (2.40) We call the phenomenon showed in Eq. (2.39) and Eq. (2.40) channel polarization. Since the capacity-achieving property of polar codes is mainly based on Theorem 2.3.1, we give a sketch of the proof of Theorem from [3] in the remainder of this subsection. Let n = log N and label the bit-channel W (i) N W (i) 2 n W b 1 b 2...b n, i = 1 + as n b j 2 n j. (2.41) j=1 Define {B n ; n 1} as a sequence of independent and identically distributed (i.i.d.) Bernoulli random variables equiprobable on the set {0, 1}. Define a random tree process {W n : n 0} as follows: W 0 = W, (2.42) W n = W B1 B 2...B n. (2.43) Given b 1 b 2... b n as a sample value of the random variables B 1 B 2... B n, the random process W n takes value W b1 b 2...b n. Let a probability space be defined as (Ω, A, P), where Ω is the space of all binary sequences (b 1 b 2...) {0, 1}. Moreover, A 0 = {φ, Ω} and A n = σ(b 1,..., B n ), where A n is the Borel Field generated by the cylinder sets (b 1,..., b n ). Finally, P is the probability measure defined on A. We now define the random processes for the channel parameters as follows: {I n ; n 0} def ={I(W n ); n 0}, (2.44) {Z n ; n 0} def ={Z(W n ); n 0}. (2.45) We proceed by presenting some important properties of these random processes.

41 2.4 Polar code encoding 19 Proposition ([3]). {(I n, A n ; n 0)} is a bounded martingale, i.e., A n A n+1 and I n is A n -measurable (2.46) E[ I n ] < (2.47) I n = E[I n+1 A n ]. (2.48) Building on Proposition 2.3.2, it can be shown that the sequence {I n ; n 0} converges almost everywhere to a random variable I such that E[I ] = I 0. Proposition ([3]). {(Z n, A n ; n 0)} is a supermartingale, i.e., A n A n+1 and Z n is A n -measurable (2.49) E[ Z n ] < (2.50) Z n E[Z n+1 A n ]. (2.51) Building on Proposition 2.3.3, it can be shown that the sequence {Z n ; n 0} converges almost everywhere to a random variable Z that takes a value in {0, 1}. Proposition ([3]). The limit I takes values almost everywhere in the set {0, 1}: P(I = 1) = I 0 and P(I = 0) = 1 I 0. Theorem is a corollary of Proposition Recall that I 0 = I(W), according to P(I = 1) = I 0, the fraction of good channels converges to I(W); according to P(I = 0) = 1 I 0, the fraction of completely noisy channels converges to 1 I(W). 2.4 Polar code encoding Given a B-DMC W, we will use F [N] to denote the set of indices of bit-channels whose symmetric mutual information is close to 0. These bit-channels are not capable of transmitting data bits reliably, and we will freeze their corresponding input u F to be predetermined values known at the decoder. We call this subset F the frozen set. The complement of u F, denoted by u F c, can be used to transmit information, and we will interchangeably call u F c information bits or data bits. The frozen set F could be composed by the indices of N K bit-channels with the largest Bhattacharrya

42 20 A review of polar codes parameter. That is, Z ( W (i) N ) ( ) (j) Z W, i F, j F c. (2.52) N Once the frozen set F is decided, the encoding procedure of polar codes is quite straightforward. We set u F to predetermined values, and u F c to the information bits to be transmitted. Then u is mapped to a codeword x through the linear encoder x = ug N. For the sake of simplicity, we let u i = 0, i F. Due to the recursive structure of G N, the encoding complexity of polar codes can be reduced from O(N 2 ), which is the complexity of vector-matrix multiplication, to O(N log N). The encoding is done layer by layer for n = log N layers and within each layer the computational complexity is O(N). Fig. 2.4 illustrates how the encoding is done for an (N, K) = (8, 4) polar code designed for a BEC with erasure probability ǫ = 1 2. Note that for BECs, closed form formulas which can be used to compute the Bhattacharrya parameters of bit-channels efficiently are introduced [3]. However, for other B-DMCs, no explicit formulas are available. We will discuss a way of constructing polar codes by approximating the quality of bit-channels in Section 2.7. So far, the encoder of polar codes is not systematic. Systematic polar coding was studied in [4] and it was observed that the bit error rate (BER) is smaller compared to non-systematic polar codes. We will focus on non-systematic polar codes throughout the thesis, since a major amount of work on polar codes is focused on non-systematic polar codes. Fig. 2.4 Encoding procedure for an (8, 4) polar code, BEC with ǫ = 1 2.

43 2.5 SC decoding of polar codes SC decoding of polar codes It has been proved in [3] that polar codes with SC decoding can achieve the capacity of B-DMCs. Since then, many other decoding algorithms have been proposed [8, 13, 42, 77, 35, 78, 36, 25] to improve the error rate performance in the finite-length regime. We will review the SC decoding algorithm in this section and review some other decoding algorithms of polar codes such as belief propagation (BP) decoding, successive cancellation list (SCL) decoding, and sphere decoding (SD) in the later chapters. Let u N 1 be the input sequences to the polar encoder, x N 1 be the corresponding codeword and y N 1 be the channel observations. The SC decoding is done in an sequential manner. The estimation of bit u i is based on the received vector y and estimations û i 1 1 of the previous bits u1 i 1. Letting u i is then estimated as L (i) N ( y, û i 1 1 ) W (i) ( N y, û i 1 = W (i) N 1 u i = 0 ) ( y, û i 1 1 u 1 = 1 ), (2.53) 0 if L (i) ( ) N y, û i 1 1 1, and i F c û i = 1 if L (i) ( ) N y, û i 1 1 < 1, and i F c frozen value otherwise. 2.6 Coding theorems Let P e (N, R, u F ) be the average block error probability of the polar code with blocklength N, rate R = N K and frozen bits u F under SC decoding. Let P e (N, R) be the average error probability of the polar code over all choices of u F, i.e., P e (N, R) = E [P e (N, R, U F )], (2.54) = P (U F = u F ) P e (N, R, u F ). u F {0,1} N K (2.55) Arıkan proved the following coding theorem in [3]. Theorem For any given B-DMC W and any fixed rate R < I(W), the average block error probability of a polar code under SC decoding satisfies P e (N, R) = O ( N 1 4 ). (2.56)

44 22 A review of polar codes Theorem is a corollary of the following theorem. Theorem For any B-DMC with I(W) > 0, and any fixed R < I(W), there exists a sequence of sets F c N [N], N {1,..., 2 n,... }, such that F c N NR and Z ( W (i) ) ( ) N O N 5 4, for all i F c N. (2.57) The following stronger version of Theorem is proved in [3] as well. Theorem For any given B-DMC W and any fixed rate R < I(W), the block error probability of a polar code with blocklength N, rate R = K N under SC decoding satisfies and frozen bits u F P e (N, R, u F ) = O ( N 1 4 ). (2.58) 2.7 Practical code constructions Although the construction of a polar code can be explicitly defined in theory, it is a challenge in practice to calculate the quality (mutual information or Bhattacharyya parameter) of the bit-channels. The output alphabet size of the corresponding bitchannels grows exponentially with the blocklength, which results in the intractability of calculating the qualities of the bit-channels. The only exception is BECs for which, closed-form formulas (Eq. (2.59) and Eq. (2.60)) are proposed in [3] to calculate the Bhattacharyya parameter of bit-channels recursively. That is, Z (W W) = 2Z(W) Z(W) 2, (2.59) Z (W W) = Z(W) 2. (2.60) Several methods have been proposed to estimate the qualities of the bit-channels [47, 75, 76, 80, 81, 37, 48, 56]. The approximation accuracy of the methods proposed in [75] is theoretically guaranteed and by far the most accurate one. Thus we use the method proposed in [75] to construct the polar codes used in this thesis, unless specified otherwise. In this section, we outline the methods proposed in [75].

45 2.7 Practical code constructions 23 The key idea in [75] is to approximate the bit-channels having an intractable output alphabet size by channels having a manageable output alphabet size. They consider two approximation methods, termed as channel degradation and channel upgradation, which yield channels with worse and better qualities than the original bit-channel, respectively. Thus the quality of the original bit-channel is bounded in between the two. Furthermore, the gap between the upper and lower bound is shown to be small [75]. Thus, the qualities of the approximated bit-channels should be close to those of the original bit-channels. Definition [75] A channel Q : X Z is degraded with respect to a channel W : X Y, denoted by Q W, if there exists a channel P : Y Z such that for all z Z and x X, Q(z x) = W(y x)p(z y). (2.61) y Y Definition [75] A channel Q : X Z is upgraded with respect to a channel W : X Y, denoted by Q W, if there exists a channel P : Y Z such that for all z Z and x X, W(y x) = Q (z x)p(y z ). (2.62) z Z The following lemma states that the degraded channel Q is worse than the original channel W measured by mutual information and Bhattacharyya parameter. Lemma [61] Given two B-DMCs P and Q, if Q W, we have Z(Q) Z(W), (2.63) I(Q) I(W). (2.64) Similarly, the following lemma states that the upgraded channel Q is better than the original channel W measured by mutual information and Bhattacharyya parameter. Lemma [61] Given two B-DMCs W and Q, if Q W, we have Z(Q ) Z(W), (2.65) I(Q ) I(W). (2.66)

46 24 A review of polar codes The following lemma states that the degradation and upgradation relation is preserved by the channel transformation defined in Section 2.2. Lemma [75] Applying the channel transformation defined in Eq. (2.27) and Eq. (2.28) to two B-DMCs W and Q, if Q W, we have Q Q W W, (2.67) Q Q W W. (2.68) On the other hand, if Q W, we have Q Q W W, (2.69) Q Q W W. (2.70) Given a B-DMC W, we now describe how to approximate its corresponding bitchannel W (i) N, i [N] with a degraded channel Q whose output alphabet size is at most L. Let (b 1 (i),..., b n (i)) denote the binary representation form of i 1, so that i = 1 + n b j (i)2 (n j). In the algorithm, the function degrading(w, L) will return a j=1 degraded channel with respect to W and with an output alphabet size at most L. Algorithm Channel degradation Input: W: The original channel; (b 1 (i),..., b n (i)): The binary representation form of the index of the bit-channel wanted to be approximated. L: Bound on the output alphabet size of the degraded channel. Output: Q (i) N : A channel that is degraded with respect to the bit-channel W (i) N. Q (i) N degrading(w, L); For j = 1,..., n do if b j (i) = 0 then

47 2.8 Summary 25 else W W W W W W end if Q (i) N degrading(w, L); end for return Q (i) N Consider a B-DMC W with output alphabet size Y, we give a simple example of the function degrading(w,l). Fig. 2.5 illustrates an operation that results in a degraded channel Q with output alphabet size Y 1. The entry in the first or second row of a channel in Fig. 2.5 is the probability of receiving the corresponding symbol given 0 or 1 is transmitted, respectively. One could obtain a degraded channel with output alphabet size L by repeating the operation Y L times. Fig. 2.5 A degrading operation. The upgrading procedure is essentially the same as Algorithm 2.7.6, except that the degrading function is changed to the upgrading function. For details on how to find the appropriate upgrading and degrading functions, we refer the reader to [75]. The frozen set F for a polar code with rate R is constructed by choosing N(1 R) indices such that Z ( Q (i) ) ( ) (j) N Z Q N for all i F, j F c. 2.8 Summary In this chapter, we gave a brief review of standard polar codes for B-DMCs. In particular, we provided a sketch of the proof that polar codes achieve the symmetric mutual

48 26 A review of polar codes information when the blocklength tends to infinity. The capacity-achieving property and its proof lay the theoretical foundations for Chapter 3 and Chapter 4. Furthermore, we reviewed a practical method to estimate the quality of virtual channels efficiently in finite blocklength regime, which is used to generate standard polar codes used in Chapter 5, Chapter 6, and Chapter 7. We also reviewed the SC decoding algorithm for polar codes, which is used as a benchmark later to show the improvement of our proposed decoding algorithms.

49 Chapter 3 Channel polarization for arbitrary-input DMCs 3.1 Introduction In the celebrated work of Arıkan [3], the channel polarization theorem is proved only for B-DMCs. We briefly reviewed the proof in Chapter 2 and showed that polar codes, constructed based on this theorem, can achieve the channel capacity under successive cancellation (SC) decoding. The channel polarization theorem is generalized to primeinput DMCs in [66], to prime power-input DMCs in [54, 55] and to arbitrary-input DMCs in [67, 63, 64]. References [54, 55, 67, 63, 64] all follow Arıkan s proof technique summarized in Section 2.3, which is based on the martingale properties of the random processes Z n and I n. In [66], the channel polarization theorem for prime-input DMCs is proved without using the martingale property of Z n. Instead, the proof is based on the entropy inequality of virtual channels (mentioned in [3] when the original channel is a B-DMC), i.e., the mutual information of the virtual channels is strictly different from that of the original channel, and the martingale property of random process {I n ; n 0}. As an extension of [66], the channel polarization theorem is proved in [52] for arbitrary DMCs with input alphabet set forming a quasigroup. In this chapter, we revisit the channel polarization problem for arbitrary DMCs, and provide an alternative proof for the channel polarization theorem for arbitraryinput DMCs. Similarly to [52], our approach does not consider the Battacharyya parameter. There are two main differences between our proof technique and the one proposed in [52]. First, while [52] proves the entropy inequality of virtual channels by lower bounding the mutual information difference between the original and the worse virtual channel, we consider the difference between the better virtual channel and the

50 28 Channel polarization for arbitrary-input DMCs original channel, for which a simple expression is given and bounded away from zero when the input alphabet and the operation used in the channel transformation forms a monoid. Though these two ideas might seem similar, this leads to a new approach for proving the strict inequality. Second, our approach makes use of the properties of Markov chains and the zero-error capacity, without involving distances between probability distributions. Moreover, we show that the extremal channels to which the virtual channels converge have a zero-error capacity equal to their capacity. We note that our proof of channel polarization theorem is restricted to group operations for now, while the stronger results in [52] apply to the wider class of quasigroups. We first review the entropy-based proof proposed in [66] and introduce the definition of the zero-error capacity. We then prove an entropy inequality for arbitrary-input DMCs with zero-error capacity being zero under a step of channel transformation, and prove that the zero-error capacity of the virtual channels is zero as well. Then, we show that the virtual channels converge to a set of channels with positive zero-error capacity or channels with zero capacity asymptotically. We conclude this chapter with some discussions on channel polarization for channels whose input alphabet set forms a monoid (not necessary a group) Entropy inequality for prime-input DMCs In this section, we briefly review the proofs in [66]. Consider a DMC W : X Y. Let the input alphabet set be X = {0,..., q 1} and assume that for all y Y there exists x X such that W(y x) > 0. Let U 1 and U 2 be independent random variables taking values from the set X. Let X 1 = U 1 U 2, (3.1) X 2 = U 2, (3.2) where denotes modulo-q addition. Let W : X Y 2 be the virtual channel between U 1 and Y 1 Y 2, and W + : X Y 2 X be the virtual channel between U 2 and Y 1 Y 2 U 1. W and W + are synthesized after one channel transformation step (see Fig. (3.1)). After n recursive steps of channel transformation, we can synthesize 2 n virtual channels. We follow the notations defined in Chapter 2. Let W n be a random variable that chooses equiprobably from all possible 2 n virtual channels after nth step, and I n = I(W n ) be the mutual information of W n. The random process {I n ; n 0} is proved in [3] to be a bounded martingale for B-DMCs.

51 3.1 Introduction 29 Fig. 3.1 One step channel transformation. Based on the assumption that the input alphabet size X = q is a prime integer, [66] proved that the symmetric mutual information of the virtual channel W is strictly less than that of the original channel W. The base of the logarithm is set to be q in this chapter. Lemma ([66]). If I(W) (δ, 1 δ), for some δ > 0, then there exists an ǫ(δ) > 0 such that I(W ) + ǫ(δ) I(W) I(W + ) ǫ(δ). (3.3) Lemma is a corollary of the following lemma. Lemma ([66]). Let X 1, X 2 X, Y 1, Y 2 Y be random variables with joint probability density P X1 X 2 Y 1 Y 2 (x 1, y 1, x 2, y 2 ) = P X1 Y 1 (x 1, y 1 )P X2 Y 2 (x 2, y 2 ). (3.4) If H(X 1 Y 1 ), H(X 2 Y 2 ) (δ, 1 δ) for some δ > 0, then there exists an ǫ(δ) > 0 such that H(X 1 + X 2 Y 1, Y 2 ) max{h(x 1 Y 1 ), H(X 2 Y 2 )} ǫ(δ). (3.5) It is easy to see that Lemma follows from Lemma We briefly explain the logic behind here. Consider a step of channel transformation described in Fig. 3.1, since U 1 and U 2 are independent and equiprobable on the supporting set X and is moduloq addition, X 1, X 2 are independent and equiprobable on X. Thus X 1, X 2, Y 1, Y 2 are jointly distributed as in Eq. (3.4). Assume I(W) (δ, 1 δ), here the base of the logarithm is set to be q, we have H(X 1 Y 1 ) = 1 I(W) (δ, 1 δ). According to Lemma 3.1.2, we have I(W ) = 1 H(X 1 + X 2 Y 1 Y 2 ) < 1 H(X 1 Y 1 ) = I(W). Since I(W ) + I(W + ) = 2I(W), Lemma is proved. Based on the martingale property of random process I n (as defined in Eq. (2.44)) and Lemma 3.1.1, one can prove the channel polarization theorem for q-ary input DMCs. However, the proof of Lemma (see [66] for details of the proof) is critically based on the assumption that the input alphabet size is a prime number. We will generalize Lemma to arbitrary-input DMCs with a different proof technique.

52 30 Channel polarization for arbitrary-input DMCs 3.2 Preliminaries Before we present the main result of this chapter, we introduce following technical lemma, which will be used to prove Lemma and Theorem Lemma For random variables X, Y, Z whose probability distributions are supported over their respective alphabets X, Y, Z, if X Y Z and Y X Z form Markov chains, then x, y, z such that P XY (x, y) > 0, P Z Y (z y) = P Z X (z x). (3.6) So for any y Y, P Z X (z x) takes on the same value for all x such that P XY (xy) > 0. Proof. See Section Now we introduce terminology that will be used throughout this chapter Zero-error capacity In [71], Shannon introduced the concept of zero-error capacity. Definition The zero-error capacity of a noisy channel is the supremum of all rates at which the information can be transmitted with zero error probability. Since the capacity of a channel is the supremum of all rates at which the information can be transmitted with vanishing error probability, a channel s zero-error capacity is always upper bounded by its capacity. We use C 0 (W) to denote the zero-error capacity of a channel W. Channels with zero-error capacity equaling zero are of primary interest. BECs with strictly positive erasure error probability, BSCs with strictly positive crossover probability and AWGN channels with strictly positive noise power are examples of such channels. Definition Let C 0 be the set of channels whose zero-error capacity is positive. Let C be the set of channels whose capacity is zero. Let C0 = C 0 C. Fig. 3.2 illustrates some channels which belong to C0, where Fig. 3.2a and Fig. 3.2c illustrate channels with positive zero-error capacity, and Fig. 3.2b illustrates channel with zero capacity.

53 3.2 Preliminaries 31 (a) Noiseless channel (b) Zero capacity channel (c) Typewriter channel Fig. 3.2 Channels belong to C 0 Now we claim the following lemma without proof, which is summarized from the statements in [71]. Lemma For a DMC W : X Y, the following statements are equivalent. 1. W / C 0 2. x 1, x 2 X, W(y x 1 )W(y x 2 ) > 0. y Y Basic algebraic structures Now we introduce some basic algebraic structures that will be considered in this chapter. Definition Suppose X is a set and an operation is defined over X. (X ; ) forms a monoid if it satisfies the following three axioms. 1. x 1, x 2 X, x 1 x 2 X. 2. x 1, x 2, x 3 X, (x 1 x 2 ) x 3 = x 1 (x 2 x 3 ). 3. There exists an element x 0 in X such that for every element x X, x x 0 = x 0 x = x. x 0 is also referred as the neutral element of (X ; ). In short, a monoid is a single operation algebraic structure satisfying closure, associativity, and the existence of an identity element. Definition A group is a monoid in which every element has an inverse.

54 32 Channel polarization for arbitrary-input DMCs For example, the set of numbers {0, 1,, q 1} with multiplication modulo q forms a monoid for all q, but only forms a group for multiplication modulo q if q is prime and 0 is removed from the set. Definition Let (X ; ) by any algebraic structure and X s be a proper subset of X. If (X s ; ) forms a group, we call (X s ; ) a subgroup of (X ; ) and denote this relation by (X s ; ) (X ; ). Note that our definition allows for a monoid to have a subgroup. Definition Given (X s ; ) (X ; ), for any x X, x X s = {x x x X s } is called the left coset of X s in X with respect to x, and X s x = {x x x X s } is called the right coset of X s in X with respect to x. According to Lagrange s Theorem, the left cosets of a subgroup in a group partition the group and the cardinality of the cosets is the same. The left cosets of a subgroup in a monoid partition the monoid as well, but their cardinalities can be different. 3.3 Entropy inequalities for arbitrary-input DMCs In this section, we will consider the scenario illustrated in Fig. 3.1, where U 1, U 2, X 1, X 2 are defined over a finite set X, and the operation is defined over X so that (X ; ) forms a monoid. We assume that U 1 and U 2 are i.i.d. random variables equiprobable on the set X, and for all y Y there exists x X such that W(y x) > 0. We first derive a closed-form expression to characterize the difference between the mutual information of the virtual channels and the original channel after a step of channel transformation. Lemma Given a DMC W : X Y, we have I(W + ) I(W) = I(X 1 ; Y 1 U 1 Y 2 ). (3.7) Proof. See Section The rest of the chapter is devoted to find the sufficient and necessary condition for I(X 1 ; Y 1 U 1 Y 2 ) > 0. We first give a sufficient condition. Lemma Given a DMC W : X Y, if the channel W / C0, then we have I(X 1 ; Y 1 U 1 Y 2 ) > 0. (3.8)

55 3.3 Entropy inequalities for arbitrary-input DMCs 33 Proof. See Section Note that Lemma provides a sufficient but not necessary condition for I(X 1 ; Y 1 U 1 Y 2 ) > 0. Based on Lemma 3.3.2, we manage to find a sufficient and necessary condition for I(X 1 ; Y 1 U 1 Y 2 ) > 0, which will be stated in Theorem Lemma Given a DMC W : X Y, we have I(W ) + I(W + ) 2I(W). (3.9) Proof. See Section The equality in Eq. (3.9) holds if (X, ) forms a group. That is, the channel transformation preserves the overall symmetric mutual information when (X, ) forms a group. The sufficient and necessary conditions for the equality in Eq. (3.9) to hold are studied in [51]. Based on Lemma and Lemma together with Lemma 3.3.3, we can prove the main result of the chapter, which is the following theorem. Theorem For a DMC W : X Y with W / C0, I(W) I(W ) > 0, (3.10) I(W + ) I(W) > 0. (3.11) The proof of Theorem is straightforward. First, Eq. (3.11) is a direct consequence of Lemma Then, Eq. (3.10) is a direct consequence of Eq. (3.11) and Lemma Moreover, Lemma will be used in the proof of Lemma (see Section 3.5.5). Theorem generalizes the results in [66] where Eq. (3.10) and Eq. (3.11) are proved for prime-input DMCs only. We will show how Theorem leads to a proof of channel polarization in the next section. In order to make arguments about entropy inequalities of virtual channels for multiple channel transformation steps, we investigate whether the virtual channels W and W + inherit the zero zero-error capacity property of the original channel W. Lemma Consider a DMC W : X Y. If the channel W / C0 and (X ; ) forms a group, then we have W + / C 0, (3.12)

56 34 Channel polarization for arbitrary-input DMCs W / C 0. (3.13) Moreover, if W / C 0 and (X ; ) only forms a monoid, we have W + / C 0, (3.14) W / C 0. (3.15) Proof. See Section Channel polarization for arbitrary-input DMCs In the previous section, we proved that the symmetric mutual information of the virtual channels is strictly different from that of the original channel after one step of channel transformation. A natural step forward is to investigate whether the symmetric mutual information of the virtual channels converges asymptotically, and if so, the set of possible values that it converges to Channel polarization over groups We first consider the case when (X, ) forms a group. Since Proposition still holds for arbitrary-input DMCs, i.e., the random process I n is a bounded martingale and I n converges almost everywhere to a random variable I. Then we have Since E[ I n+1 I n ] = 1 2 n+1 2 n i=1 we have that as n, lim E[ I n+1 I n ] = 0. (3.16) n { I ( W (2i 1) 2 n+1 ) I ( W (i) 2 n ) + I ( W (2i) 2 n+1 ) I ( W (i) 2 n ) }, (3.17) I ( W (2i 1) 2 n+1 ) I ( W (i) 2 n ) = 0, (3.18) I ( W (2i) 2 n+1 ) I ( W (i) 2 n ) = 0. (3.19) So Eq. (3.16) together with Theorem imply that for any W / C 0, its corresponding virtual channels will converge to channels in C 0 asymptotically.

57 3.4 Channel polarization for arbitrary-input DMCs 35 As for the set of values the virtual channels will converge to, we need to investigate the set of invariant channels under channel transformation, i.e., channels with I(X 1 ; Y 1 U 1 Y 2 ) = 0. Definition Let C inv (X ) denote the the set of channels with input alphabet set X and I(X 1 ; Y 1 U 1 Y 2 ) = 0 after one step of channel transformation. It follows from Lemma that C inv (X ) C 0. Theorem Given a DMC W : X Y, a necessary and sufficient condition for W C inv (X ) is that both following statements are fulfilled. 1. W : X Y can be decomposed into t 1 disjoint subchannels W i : X i Y i, with X i X and Y i Y and W i C, i [t], and 2. X s {X 1,, X t } such that (X s ; ) (X ; ) and any X i {X 1,, X t } is a left coset of X s. Moreover, if W C inv, then Proof. See Section W C inv, (3.20) W + C inv. (3.21) Channels described in Fig. 3.2a and Fig. 3.2b belong to C inv. In particular, Fig. 3.2a can be decomposed into 2 zero capacity subchannels, each with input alphabet size 1. With Eq. (3.16), Theorem 3.3.4, and Theorem 3.4.2, we can conclude that successive transformations of channels with zero-error capacity equal to zero will give rise to channels converging towards a set of channels C inv (X ) with positive zero-error capacity or channels with zero capacity asymptotically. Let W denote the limit random variable of the random process {W n ; n 0} as defined previously by Eq. (2.43), we have W = lim n W n C inv. Note that this does not conflict with Lemma , which states that after any finite steps of channel transformation to a DMC W / C 0, the corresponding bit-channels W (i) 2 n / C 0, i = {1,, 2 n }. Now we investigate the value of the limit random variable I. Theorem implies that the set C inv is a set of sum channels of which every component channel has zero capacity. Moreover, W C inv, the zero-error capacity of channel W equals to its capacity. The logic behind this is as follows: W C inv, we have C 0 (W) C(W), C(W) = log( t 2 C(Wi) ) = log t and C 0 (W) log t = C(W), where t is the number of i=1 disjoint subchannels. Thus we can conclude that W C inv, C 0 (W) = C(W) = log t.

58 36 Channel polarization for arbitrary-input DMCs Theorem Given a DMC W : X Y, W takes values in the set C inv (X ) and I takes values in { log X X s (X s; ) (X ; ) }. For example, given a channel W : X Y with X = 6, I takes values in {log 1, log 2, log 3, log 6}. Theorem is a direct consequence of Theorem 3.4.2, so we skip the proof Channel polarization over monoids We now briefly discuss channels whose input alphabet set is not a group, but instead only a monoid. The proof of Theorem is still valid, but equality may not be achieved in Eq. (3.9). A consequence of this is that the random process I n is no longer a martingale, but instead a supermartingale. Moreover, I(W + ) I(W) = I(X 1 ; Y 1 U 1 Y 2 ) = 0 does not necessarily imply that I(W ) I(W) = 0. Instead, I(W ) I(W) = 0 if W C. The possible values of W and I are not known. Intuitively, one might expect that W takes values in C and I = Proofs Proof of Lemma Since X Y Z and Y X Z both form Markov chains, we have P Z XY (z xy) = P Z X (z x) = P Z Y (z y), if P XY (xy) > 0. (3.22) This completes the proof Proof of Lemma According to the chain rule of entropy, we have I(W + ) I(W) = I(U 2 ; Y 1 Y 2 U 1 ) I(U 2 ; Y 2 ) (3.23) = H(U 2 Y 2 ) H(U 2 Y 1 Y 2 U 1 ) (3.24) = I(U 2 ; Y 1 U 1 Y 2 ) (3.25) = H(U 1 Y 1 Y 2 ) H(U 1 Y 1 U 2 Y 2 ) (3.26) = H(U 1 Y 2 ) + H(Y 1 U 1 Y 2 ) H(U 1 U 2 Y 2 ) H(Y 1 U 1 U 2 Y 2 ) (3.27)

59 3.5 Proofs 37 = H(Y 1 U 1 Y 2 ) H(Y 1 U 1 U 2 Y 2 ) (3.28) = H(Y 1 U 1 Y 2 ) H(Y 1 X 1 ) (3.29) = H(Y 1 U 1 Y 2 ) H(Y 1 X 1 U 1 Y 2 ) (3.30) = I(X 1 ; Y 1 U 1 Y 2 ). (3.31) In particular, Eq. (3.28) comes from the fact that U 1 U 2 Y 2 forms a Markov chain. Eq. (3.29) and Eq. (3.30) come from the fact that Y 1 X 1 (U 1, Y 2 ) forms a Markov chain. This completes the proof Proof of Lemma We prove by contradiction. Assume I(X 1 ; Y 1 U 1 Y 2 ) = 0, we will show that this will lead to a contradiction that W C 0. By this assumption, we have that X 1 (U 1, Y 2 ) Y 1 forms a Markov chain. By construction (see Fig. 3.1), (U 1, Y 2 ) X 1 Y 1 forms a Markov chain too. Hence, we are in the scenario of Lemma Note that U 1 and Y 2 are independent, we have P U1,Y 2 (u, y 2 ) = P U1 (u)p Y2 (y 2 ). We have assumed that U 1 is uniformly distributed over X, and y Y, x X such that W(y x) > 0. Then we have P U1,Y 2 (u, y 2 ) = 1 q P Y 2 (y 2 ) > 0, u X, y 2 Y. If there exists u, y 2 such that P X1 U 1,Y 2 (x u, y 2 ) > 0 for all x X, then by Lemma 3.2.1, for any y, W(y x) has the same value for all x X and hence I(W) = 0, which contradicts the condition of the lemma. We can hence assume that, u, y 2, the set X u,y2 = {x X P X1 U 1,Y 2 (x u, y 2 ) > 0} is a proper subset of X. Consider the set X 0,y for some y corresponding to U 1 = 0 and Y 2 = y, where 0 is the neutral element of the monoid (X ; ). Examining Fig. 3.1 for U 1 = 0, we observe that X 1 = X 2 = U 2, and hence the setup conditioned on U 1 = 0 is equivalent to the setup in Fig In this case, X 1 = X 2 is equiprobable on the set X. From the figure, it is clear that X 0,y Y 1 W X 2 = X 1 W Y 2 Fig. 3.3 Setup conditioned on U 1 = 0. is non-empty, since otherwise it would contradict the definition of a channel. Since X 0,y is a non-empty proper subset of X, its complement X0,y c = X \ X 0,y is also a non-empty proper subset of X. Let x and x be elements of X 0,y and X0,y, c respectively. By the definition of X 0,y, we know that W(y x) > 0 and W(y x) = 0. Pick any ỹ such that W(ỹ x) > 0. Let us assume for now that W(ỹ x) > 0 as well. W(ỹ x) > 0 and

60 38 Channel polarization for arbitrary-input DMCs W(ỹ x) > 0 imply P X1 Y 2 (x, ỹ) > 0, and (3.32) P X1 Y 2 ( x, ỹ) > 0, (3.33) respectively. Lemma with Eq. (3.32) and Eq. (3.33) gives, for any y 1, P Y1 X 1 (y 1 x) = P Y1 X 1 (y 1 x), (3.34) which is impossible by construction because W(y x) > 0 and W(y x) = 0. Hence, our assumption that W(ỹ x) > 0 leads to a contradiction, and we conclude that W(ỹ x) = 0. Having shown that for any ỹ Y such that W(ỹ x) > 0, W(ỹ x) = 0, it follows that inputs x and x can be used to transmit 1 bit with zero probability of error over the channel, which contradicts the condition of the lemma. This completes the proof Proof of Lemma Now we prove that the overall mutual information will be non-increasing after a step of channel transformation. According to the chain rule of mutual information and entropy, we have that I(W ) + I(W + ) = I(U 1 ; Y 1 Y 2 ) + I(U 2 ; Y 1 Y 2 U 1 ) (3.35) = I(U 1 ; Y 1 Y 2 ) + I(U 2 ; Y 1 Y 2 U 1 ) (3.36) = I(U 1 U 2 ; Y 1 Y 2 ) (3.37) = I(X 1 X 2 ; Y 1 Y 2 ) (3.38) = H(Y 1 Y 2 ) H(Y 1 Y 2 X 1 X 2 ) (3.39) = H(Y 2 ) + H(Y 1 Y 2 ) H(Y 1 X 1 ) H(Y 2 X 2 ) (3.40) I(X 2 ; Y 2 ) + H(Y 1 ) H(Y 1 X 1 ) (3.41) = 2I(W). (3.42) A sufficient but not necessary condition for the equality in Eq. (3.41) to hold is that X 1 and X 2 is independent from each other, e.g., (X, ) forms a group. Studying the full range of operations that yield equality in Eq. (3.41) is an interesting problem that has been studied in [51].

61 3.5 Proofs Proof of Lemma We first prove W + / C 0. The transition probability of channel W + : U 2 U 1 Y 1 Y 2 is W + (y 1 y 2 u 1 u 2 ) = P U1 (u 1 )W(y 1 u 1 u 2 )W(y 2 u 2 ). (3.43) Then for any u 2, u 2 X, we have y 1 Y y 2 Y u 1 X = y 1 Y y 2 Y u 1 X W + (y 1 y 2 u 1 u 2 )W + (y 1 y 2 u 1 u 2) (3.44) ( (PU1 (u 1 )) 2 W(y 2 u 2 )W(y 2 u 2)W(y 1 u 1 u 2 )W(y 1 u 1 u 2) ) (3.45) = W(y 2 u 2 )W(y 2 u 2) (P U1 (u 1 )) 2 W(y 1 u 1 u 2 )W(y 1 u 1 u 2) y 2 Y u 1 X y 1 Y }{{}}{{} >0 >0 (3.46) > 0. Eq. (3.47) along with Lemma implies C 0 (W + ) = 0, that is to say, Moreover, since (3.47) W + / C 0. (3.48) I(W + ) = I(U 2 ; Y 1 Y 2 U 1 ) (3.49) = I(U 2 ; Y 2 ) + I(U 2 ; Y 1 U 1 Y 2 ) (3.50) I(W) (3.51) > 0, (3.52) we have W + / C. (3.53) Based on Eq. (3.48) and Eq. (3.53), we conclude that W + / C 0, (3.54)

62 40 Channel polarization for arbitrary-input DMCs which completes the first part of the proof. The transition probability of the channel is W (y 1 y 2 u 1 ) = Then for any u 1, u 1 X, we have y 1 Y y 2 Y = y 1 Y y 2 Y u 2 X W : U 1 Y 1 Y 2 P U2 (u 2 )W(y 1 u 1 u 2 )W(y 2 u 2 ). (3.55) W (y 1 y 2 u 1 )W (y 1 y 2 u 1) (3.56) y 1 Y y 2 Y u 2 X P U2 (u 2 )W(y 1 u 1 u 2 )W(y 2 u 2 ) P U2 (u 2)W(y 1 u 1 u 2)W(y 2 u 2) u 2 X (3.57) (P U2 (u 2 )W(y 1 u 1 u 2 )W(y 2 u 2 )P U2 (u 2)W(y 1 u 1 u 2)W(y 2 u 2)) (3.58) = P U2 (u 2 )P U2 (u 2) W(y 2 u 2 )W(y 2 u 2) W(y 1 u 1 u 2 )W(y 1 u 1 u 2) y 2 Y } {{ y 1 Y }} {{ } >0 >0 > 0, (3.59) (3.60) where Eq. (3.58) holds for any u 2, u 2 X and this follows from the fact that the summation of non-negative numbers is larger or equal to any addend. Eq. (3.60) along with Lemma implies that C 0 (W ) = 0, that is to say, W / C 0. (3.61) Next we prove W / C if W / C, i.e., I(W ) > 0 if I(W) > 0. We will prove the equivalent proposition that I(W ) = 0 implies I(W) = 0. Consider the series of equations, assuming I(W ) = 0, I(W + ) I(W) = I(X 1 ; Y 1 U 1 Y 2 ) = I(W) I(W ) = I(W), (3.62) I(X 1 ; Y 1 U 1 Y 2 ) = I(X 1 ; Y 1 ), (3.63) H(Y 1 U 1 Y 2 ) H(Y 1 X 1 ) = H(Y 1 ) H(Y 1 X 1 ), (3.64) H(Y 1 U 1 Y 2 ) H(Y 1 ) = 0, (3.65) I(Y 1 ; U 1 Y 2 ) = 0, (3.66)

63 3.5 Proofs 41 I(Y 1 ; U 1 ) + I(Y 1 ; Y 2 U 1 ) = 0, (3.67) I(Y 1 ; Y 2 U 1 ) = 0, (3.68) I(Y 1 ; Y 2 U 1 = 0) = 0, (3.69) where in the last step U 1 is the neutral element of the group. The second equality in Eq. (3.62) holds when (X ; ) forms a group (see Lemma 3.3.3). The left-hand side of Eq. (3.64) comes from the fact that Y 1 X 1 (U 1, Y 2 ) forms a Markov chain. Eq. (3.68) comes from the non-negative property of mutual information. We will look into the joint distribution of (Y 1, Y 2 ) given U 1 = 0. All following arguments are conditioned on U 1 = 0 and we omit this expression for simplicity. Since U 1 = 0, we let X = X 1 = X 2 = U 2 be a uniform random variable on X. Then the joint distribution satisfies P Y1 Y 2 (y 1 y 2 ) = x = x P Y1 Y 2 X(y 1 y 2 x)p X (x) (3.70) P Y1 X(y 1 x)p Y2 X(y 2 x)p X (x), (3.71) and the marginal distributions satisfy ( ) ( ) P Y1 (y 1 )P Y2 (y 2 ) = P Y1 X(y 1 x)p X (x) P Y2 X(y 2 x)p X (x). (3.72) x x Since I(Y 1 ; Y 2 U 1 = 0) = 0, Y 1 and Y 2 are independent (given U 1 = 0). We have P Y1 Y 2 (y 1 y 2 ) = P Y1 (y 1 )P Y2 (y 2 ), (3.73) ( ) ( ) P X (x)p Y1 X(y 1 x)p Y2 X(y 2 x) = P X (x)p Y1 X(y 1 x) P X (x)p Y2 X(y 2 x), x x x (3.74) 1 P Y1 X(y 1 x)p Y2 X(y 2 x) = 1 ( ) ( ) P q q 2 Y1 X(y 1 x) P Y2 X(y 2 x), (3.75) x x for all y 1, y 2 Y. Let y 1 = y 2 = y and note that P Y1 X and P Y2 X are both the transition probability of the original channel W, denoted by P Y X, we have 1 q ( PY X (y x) ) ( = P Y X (y x)). (3.76) x q x x

64 42 Channel polarization for arbitrary-input DMCs According to Jensen s inequality, the equality is achieved if and only if all terms are equal, i.e., for each y Y, P Y X (y x) = c x X, where c is a constant depending on y. Thus for each y Y, P X Y (x y) = 1 q, x X. Then I(W) must satisfy I(W) = H(X) H(X Y ) (3.77) = log q y P Y (y)h(x Y = y) (3.78) = log q log q y P Y (y) (3.79) = 0. (3.80) We have shown that I(W ) = 0 implies I(W) = 0, equivalently, if W / C, W / C. (3.81) Based on Eq. (3.61) and Eq. (3.81), we conclude that W / C 0. (3.82) Furthermore, we notice that in the proof of Eq. (3.48) and Eq. (3.61), the inverse property of a group is not required. Thus if W / C 0 and (X ; ) forms a monoid, we have W / C 0, (3.83) W + / C 0. (3.84) This completes the proof Proof of Theorem We first prove the necessary condition for W C inv. This is a stronger result than what has been proved in Lemma We follow the idea in the proof of Lemma Let X u,y2 = {x X P X1 U 1,Y 2 (x u, y 2 ) > 0} and Y u,y2 = {y Y P Y1 U 1,Y 2 (y u, y 2 ) > 0}. Assume W C inv, i.e., I(X 1 ; Y 1 U 1 Y 2 ) = 0. According to the proof of Lemma 3.3.2, we have following two cases.

3.5 Proofs 43 Case 1: If u, y 2 such that X uy2 = X, then W C and both conditions are fulfilled.

4 illustrates the channel described by case 1, where lines with the same color represent the same

Case 2 : If u, y 2, the set X u,y2 is a proper subset of X. Examining Figure 3.

hence the setup conditioned on U 1 = 0 is equivalent to the setup in Figure 3.

65 3.5 Proofs 43 Case 1: If u, y 2 such that X uy2 = X, then W C and both conditions are fulfilled. Fig. 3.4 illustrates the channel described by case 1, where lines with the same color represent the same transition probability. Fig. 3.4 Channel described by case 1. Case 2 : If u, y 2, the set X u,y2 is a proper subset of X. Examining Figure 3.1 for U 1 = 0, where 0 is the neutral element for (X, ), we observe that X 1 = X 2 = U 2, and hence the setup conditioned on U 1 = 0 is equivalent to the setup in Figure 3.3. Fig. 3.5 illustrates the channel described in case 2. We first prove Condition 1. According to the proof of Lemma 3.3.2, we have that if x 0 X 0y and x 1 X0y, c then W(y x 0 )W(y x 1 ) = 0. (3.85) y Y We have y i, y j Y, X 0yi = X 0yj or X 0yi X 0yj =. (3.86) This can be seen via a proof by contradiction. Assuming the contrary, we can find x 0 X 0yi X 0yj and x 1 X 0yi X0y c j such that W(y x 0 ) = W(y x 1 ), (3.87) W(y x 0 )W(y x 1 ) = 0. (3.88) y Y

An Alternative Proof of Channel Polarization for Channels with Arbitrary Input Alphabets

An Alternative Proof of Channel Polarization for Channels with Arbitrary Input Alphabets Jing Guo University of Cambridge jg582@cam.ac.uk Jossy Sayir University of Cambridge j.sayir@ieee.org Minghai Qin