Introduction to Convolutional Codes, Part 1

Introduction to Convolutional Codes, Part 1 Frans M.J. Willems, Eindhoven University of Technology September 29, 2009

Elias, Father of Coding Theory Textbook Encoder Encoder Properties Systematic Codes and Different Rates Transmission

Peter Elias (U.S., 1923-2001) Figure: P. Elias. 1955 - Coding for Noisy Channels. Hamming had already introduced parity-check codes, but Peter went a giant step farther by showing for the binary symmetric channel that such linear codes suffice to exploit a channel to its fullest. In particular, he showed that error probability as a function of delay is bounded above and below by exponentials, whose exponents agree for a considerable range of values of the channel and the code parameters, and that these same results apply to linear codes. These exponential error bounds presaged those obtained for general channels ten years later by Gallager. In this same paper Peter introduced and named convolutional codes. His motivation was to show that it was in principle possible, by using a convolutional code with infinite constraint length, to transmit information at a rate equal to channel capacity with probability one that no decoded symbol will be in error. (by J.L. Massey)

Textbook Encoder Encoder Properties Systematic Codes and Different Rates Transmission Textbook Convolutional Encoder We consider the encoder that appears in almost every elementary text on convolutional codes. It consists of two connected delay elements (a shift register) and two modulo-2 adders (EXORs). The output of a delay element time t is equal to its input at time t 1. The time t is integer. u(t) s 1 (t) D s 2 (t) D v 1 (t) v 2 (t) Motivation is that shift-registers can be used to produce random sequences, and we have learned from Shannon that random codes reach capacity.

Textbook Encoder Encoder Properties Systematic Codes and Different Rates Transmission Finite-State Description Let T be the number of symbols that is to be encoded. For the input u(t) of the encoder we assume that u(t) {0, 1}, for t = 1, 2,, T and u(t + 1) = u(t + 2) = 0. (1) For the outputs v 1 (t) and v 2 (t) of the encoder we then have that v 1 (t) = u(t) s 2 (t), v 2 (t) = u(t) s 1 (t) s 2 (t), for t = 1, 2,, T + 2, (2) while the states s 1 (t) and s 2 (t) satisfy s 1 (1) = s 2 (1) = 0, s 1 (t) = u(t 1), s 2 (t) = s 1 (t 1), for t = 2, 3,, T + 3. (3) Note the the encoder starts and stops in the all-zero state (s 1, s 2 ) = (0, 0).

Textbook Encoder Encoder Properties Systematic Codes and Different Rates Transmission Rate, Memory, Constraint Length Codewords are now created as follows: inputword = u(1), u(2),, u(t ), codeword = v 1 (1), v 2 (1), v 1 (2), v 2 (2),, v 1 (T + 2), v 2 (T + 2). (4) The length of the input words is T hence there are 2 T codewords. We assume that they all have probability 2 T, or I (1), U(2),, U(T ) are all uniform. The length of the codewords is 2(T + 2) therefore the rate R T = log 2T 2(T + 2) = T 2(T + 2) = 1 2 1 T + 2. (5) Note that R = R T = 1/2. We therefore call our encoder a rate-1/2 encoder. The memory M associated with our code is 2. In general the encoder uses the past inputs u t M,, u t 1 and the current input u t to construct the outputs v 1 (t), v 2 (t). Related to this is the constraint length K = M + 1, since a new pair of outputs is determined by M previous input symbols and the current one.

Textbook Encoder Encoder Properties Systematic Codes and Different Rates Transmission Convolution, Linearity Why do we call this a convolutional encoder? To see why note that s 1 (t) = u(t 1) and s 2 (t) = s 1 (t 1) = u(t 2). Therefore v 1 (t) = u(t) s 2 (t) = u(t) u(t 2) v 2 (t) = u(t) s 1 (t) s 2 (t) = u(t) u(t 1) u(t 2). (6) Define the impulse responses h 1 (t) and h 2 (t) with coefficients h 1 (0), h 1 (1), h 1 (2) = 1, 0, 1, h 2 (0), h 2 (1), h 2 (2) = 1, 1, 1, (7) and the other coefficients equal to zero, then v 1 (t) = u(t τ)h 1 (τ) = u h 1 (t), v 2 (t) = τ=0,1, u(t τ)h 2 (τ) = u h 2 (t), (8) τ=0,1, result from convolving u with h 1 and h 2. This makes our convolutional code linear.

Textbook Encoder Encoder Properties Systematic Codes and Different Rates Transmission A Systematic Convolutional Code u(t) v 1 (t) D D D v 2 (t) The encoder in the figure above is systematic since one of its outputs is equal to the input i.e. v 1 (t) = u(t). The rate R of this code is 1/2, its memory M = 3.

Textbook Encoder Encoder Properties Systematic Codes and Different Rates Transmission A Rate-2/3 Convolutional Code u 1 (t) u 2 (t) D D v 1 (t) v 2 (t) v 3 (t) For every k = 2 binary input symbols the encoder in figure above produces n = 3 binary output symbols. Therefore its rate R = k/n = 2/3. The memory M of this encoder is 1 since only u 1 (t 1) and u 2 (t 1) are used to produce v 1 (t), v 2 (t), and v 3 (t).

Textbook Encoder Encoder Properties Systematic Codes and Different Rates Transmission Transmission via a BSC Suppose that we use our example encoder and take the length T of the input words u = (u 1, u 2,, u T ) equal to 6. We then get codewords x = (x 1, x 2,, x 2(T +2) ) = (v 1 (1), v 2 (1), v 1 (2),, v 2 (T + 2)) with codeword length equal to 2(T + 2) = 16. We assume that all codewords are equiprobable. These codewords are transmitted over a binary symmetric channel (BSC), see the figure below, with cross-over probability 0 p 1/2. Now suppose that we receive 1 p 0 0 p x y p 1 1 1 p y = (y 1, y 2,, y 2(T +2) ) = (10, 11, 00, 11, 10, 11, 11, 00). (9) How should we efficiently decode this received sequence?

Communicating a Message, Error Probability P(m) m e(m) x y m P(y x) d(y) Consider the communication system in the figure. A message source produces message m M with a-priori probability Pr{M = m}. An encoder transforms the message into a channel input 1 x X, hence x = e(m). Now the channel output 2 y Y is received with probability Pr{Y = y X = x}. The decoder observes y and produces an estimate m M of the transmitted message, hence m = d(y). How should decoding rule d( ) be chosen such that the error probability is minimized? P e = Pr{ M M} (10) 1 This is in general a sequence. 2 Typically a sequence.

The Maximum A-Posteriori Probability (MAP) Rule First we form an upper bound for the probability that no error occurs: 1 P e = y = y Pr{M = d(y), Y = y} Pr{Y = y} Pr{M = d(y) Y = y} y Pr{Y = y} max Pr{M = m Y = y}. (11) m Observe that equality is achieved if and only if 3 d(y) = arg max Pr{M = m Y = y}, for all y that can occur. (12) m Since {Pr{M = m Y = y}, m M} are the a-posteriori probabilities after having received y, we call this rule the maximum a-posteriori probability rule (MAP-rule). 3 It is possible that the maximum is not obtained for a unique m.

The Maximum-Likelihood (ML) Rule Suppose that all message probabilities are equal. Then d(y) = arg max Pr{M = m Y = y}, m = arg max m = arg max m Pr{M = m} Pr{Y = y M = m} Pr{Y = y} Pr{Y = y M = m} M Pr{Y = y} = arg max Pr{Y = y M = m} m = arg max Pr{Y = y X = e(m)}, for all y that can occur. (13) m Since {Pr{Y = y X = e(m)}, m M} are the likelihoods for receiving y, we call this rule the maximum-likelihood rule (ML-rule).

The Minimum-Distance (MD) Rule Suppose that all message probabilities are equal, and that e(m) is a binary codeword of length L, for all m M. Moreover let y be the output sequence of a binary symmetric channel with cross-over probability 0 p 1/2, when e(m) is its input sequence. Then ( ) Pr{Y = y X = e(m)} = p d H (e(m),y) (1 p) L d H (e(m),y) p dh = (1 p) L (e(m),y). 1 p (14) where d H (, ) denotes Hamming distance. Now d(y) = arg max Pr{Y = y X = e(m)}, m = arg min d m H (e(m), y), for all y that can occur. (15) Since {d H (e(m), y), m M} are the Hamming distances between codewords e(m) and the received sequence y, we call this rule the minimum (Hamming) distance rule (MD-rule). Conclusion is that minimum Hamming distance decoding should be applied to decode y = (10, 11, 00, 11, 10, 11, 11, 00).

Complexity of We could do an exhaustive search. Using minimum-distance (MD) decoding we could search all 2 T = 64 codewords. A serious disadvantage of this approach is that the search complexity increases exponentially in the number of input symbols T. We will therefore discuss an efficient method, called the Viterbi algorithm. The complexity of this method is linear in T.

Transition Table For our textbook encoder we can determine the transition table. This table contains the output pair v 1 v 2 (t) and next state s 1 s 2 (t + 1) given current state s 1 s 2 (t) and input u(t). See below: This table leads to the state diagram. u(t) s 1 s 2 (t) v 1 v 2 (t) s 1 s 2 (t + 1) 0 0, 0 0, 0 0, 0 1 0, 0 1, 1 1, 0 0 0, 1 1, 1 0, 0 1 0, 1 0, 0 1, 0 0 1, 0 0, 1 0, 1 1 1, 0 1, 0 1, 1 0 1, 1 1, 0 0, 1 1 1, 1 0, 1 1, 1

State Diagram In the state diagram, see figure below, states are denoted by s 1 s 2 (t). Along the branches that lead from the current state s 1 s 2 (t) to the next state s 1 s 2 (t + 1) we find the input/outputs u(t)/v 1 v 2 (t)). 0/00 00 0/11 1/11 1/00 01 10 0/01 0/10 1/10 11 1/01

Trellis Diagram To see what sequences of states are possible as a function of the time t we can take a look at the trellis (in Dutch hekwerk ) diagram (see figure below). The horizontal axis is the time-axis. States are denoted by s 1 s 2 (t), and along the branches are again the input and outputs u(t)/v 1 v 2 (t). 0/00 0/00 0/00 0/00 0/00 0/00 0/00 0/00 00 00 00 00 00 00 00 00 00 1/11 1/11 1/11 1/11 1/11 1/11 1/11 1/11 0/11 0/11 0/11 0/11 0/11 0/11 0/11 0/11 01 01 01 01 01 01 01 01 01 1/00 1/00 1/00 1/00 1/00 1/00 1/00 1/00 0/01 0/01 0/01 0/01 0/01 0/01 0/01 0/01 10 10 10 10 10 10 10 10 10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 0/10 0/10 0/10 0/10 0/10 0/10 0/10 0/10 11 11 11 11 11 11 11 11 11 1/01 1/01 1/01 1/01 1/01 1/01 1/01 1/01

Truncated Trellis Each codeword now corresponds to a path in the trellis diagram. This path starts at t = 1 in state s 1 s 2 = 00 and traverses along T + 2 branches and then ends at t = T + 3 in state s 1 s 2 = 00 again. The truncated trellis (see figure below) contains only the states and branches that can actually occur. 0/00 0/00 0/00 0/00 0/00 0/00 0/00 0/00 00 00 00 00 00 00 00 00 00 1/11 1/11 1/11 1/11 1/11 1/11 0/11 0/11 0/11 0/11 0/11 0/11 01 01 01 01 01 01 1/00 1/00 1/00 1/00 0/01 0/01 0/01 0/01 0/01 0/01 10 10 10 10 10 10 1/10 1/10 1/10 1/10 1/10 0/10 0/10 0/10 0/10 0/10 11 11 11 11 11 START STOP 1/01 1/01 1/01 1/01

Trellis with Branch Distances After having received channel output sequence y, to do MD-decoding, we must be able to compute Hamming distances d H (x, y) where x is a codeword. Since T +2 d H (x, y) = d H (v 1 v 2 (t), y(2t 1)y(2t)), (16) t=1 we first determine all branch distances d H (v 1 v 2 (t), y(2t 1)y(2t)), see figure. y = 10 11 00 11 10 11 11 00 00-1 00-2 00-0 00-2 00-1 00-2 00-2 00-0 00 00 00 00 00 00 00 00 00 11-1 11-0 11-2 11-0 11-1 11-0 11-2 11-0 11-1 11-0 11-0 11-2 01 01 01 01 01 01 00-0 00-2 00-1 00-2 01-1 01-1 01-1 01-2 01-1 01-1 10 10 10 10 10 10 10-1 10-1 10-1 10-0 10-1 10-1 10-1 10-0 10-1 10-1 11 11 11 11 11 START STOP 01-1 01-1 01-2 01-1

Viterbi s Principle Let s = s 1 s 2 and v = v 1 v 2. s (t) v (t) s (t) v (t) s(t + 1) Assume that state s(t + 1) at time t + 1 can be reached only via states s (t) and s (t) at time t through the branches v (t) and v (t) respectively. Then (Viterbi [1967]): Best path to s(t + 1) = best of ( all paths to s (t) extended by v (t), all paths to s (t) extended by v (t) ) = best of ( best path to s (t) extended by v (t), best path to s (t) extended by v (t) ). This principle can be used recursively. First determine the best path leading from the start state s(1) to all states at time 2, then the best path to all states at time 3, etc., and finally determine the best part to the final state s(t + 3). Dijkstra s shortest path algorithm [1959] is more general and more complex than the Viterbi method.

The Define D s(t) to be the total distance of a best path leading to state s at time t. Let B s(t) denote a best path leading to this state. Define d s,s(t 1, t) to be the distance corresponding to the branch connecting state s at time t 1 to state s at time t. Let b s,s(t 1, t) denote this branch. 1. Set t := 1. Also set the total distance of the starting state D 00 (1) := 0 and set the best path leading to it B 00 (1) := φ i.e. equal to the empty path. 2. Increment t i.e. t := t + 1. For all possible states s at time t let A s(t) be the set of states at time t 1 that have a branch leading to state s at time t. Assume that s A s(t) minimizes D s (t 1) + d s,s(t 1, t) i.e. survives. Then set Here denotes concatenation. D s(t) := D s (t 1) + d s,s(t 1, t) B s(t) := B s (t 1) b s,s(t 1, t). 3. If t = T + 3 output the best path B 00 (t), otherwise go to step 2.

Forward: Add, Compare, and Select Best-path-metrics are denoted in the states. An asterisk denotes that each of the two incoming paths into a state can be chosen as survivor. y = 10 00-1 0 1 11-1 1 START 11-0 01-1 10-1 11 00 11 10 11 11 00 00-2 00-0 00-2 00-1 00-2 00-2 00-0 3 3 2 3 3 4 4 11-2 11-0 11-1 11-0 11-2 11-0 11-1 11-0 11-0 11-2 2 2 3* 3 4* 4 00-0 00-2 00-1 00-2 01-1 01-1 01-2 01-1 01-1 1 2 3 3 3 10-1 10-1 10-0 10-1 10-1 10-1 10-0 10-1 10-1 2 2 3* 3 4* STOP 01-1 01-1 Note that there is a best path (codeword) at distance 4 from the received sequence. 01-2 01-1

Backward: Trace Back Tracing back from the stop state results in decoded path (00, 11, 01, 11, 11, 01, 11, 00), The corresponding input sequence is (0, 1, 0, 0, 1, 0). An equally good input sequence would be (0, 0, 0, 1, 1, 0). The second and fourth input digit are therefore not so reliable.... x = 00 11 01 11 00-1 00-2 00-0 0 1 3 3 11-1 11-0 11-2 11-0 11-2 11-0 2 2 00-0 00-2 01-1 01-1 01-1 1 1 2 START 10-1 11 01 11 00 00-2 00-1 00-2 00-2 00-0 2 3 3 4 4 11-1 11-0 11-1 11-0 11-0 11-2 3* 3 4* 4 00-1 00-2 01-2 01-1 01-1 3 3 3 10-1 10-1 10-0 10-1 10-1 10-1 10-0 10-1 10-1 2 2 3* 3 4* STOP 01-1 01-1 01-2 01-1

Complexity Fortunately the complexity of the Viterbi algorithm is linear in the codeword length T. At each time we have to add, compare and select (ACS) in every state. The complexity is therefore also linear in the number of states at each time, which is 4 in our case. In general the number of states is 2 m where m is the number of delay elements in the encoder. Therefore Viterbi decoding is in practise only possible (now) if m is not much higher than say 10, i.e. the number of states is not much more than 2 10 = 1024. Later we shall see that the code performance improves for increasing values of m.

Exercise We transmit an information-word (x(1), x(2), x(3), x(4), x(5)) over an inter-symbol-interference (ISI) channel. This information-word is preceded and followed by zeroes, hence x(t) = 0 for integer t / {1, 2, 3, 4, 5} x(t) { 1, 1} for t {1, 2, 3, 4, 5}. All 32 information-words occur with equal probability. For the ISI channel for integer times t we have that y(t) = x(t) + x(t 1) + n(t) where the probability density function of the noise n(t) is given by p(n) = 1 2π exp( n2 2 ), thus n(t) has a Gaussian density. The received sequence satisfies y(1) = +0.3, y(2) = +0.2, y(3) = +0.1, y(4) = 1.1, y(5) = +2.5 en y(6) = +0.5. Decode the information-word with a decoder that minimizes the word-error probability. Show first that the decoder should minimize (squared) Euclidean distance.