Ch 0 Introduction. 0.1 Overview of Information Theory and Coding

Size: px

Start display at page:

Download "Ch 0 Introduction. 0.1 Overview of Information Theory and Coding"

Bernard Eaton
6 years ago
Views:

1 Ch 0 Introduction 0.1 Overview of Information Theory and Coding Overview The information theory was founded by Shannon in This theory is for transmission (communication system) or recording (storage system) over/in a channel. The Channel can be wireless or wire channel (communication: copper telephone or fiber optic cables), magnetic or optical disks (storage). There are three aspects that need to be considered: (Compression) (Error Detection and Correction) (Cryptography) Information Theory is based on the Probability Theory. A communication or compression procedure includes: Sent messages source Source coding encoder Channel coding symbols channel Channel Source decoding decoding decoder Received messages receiver Compression Source Entropy Error Detection and Correction Channel Capacity Decompression 1

2 Digital Communication and Storage Systems A basic information processing system consists of Channel: produces a received signal r which differs from the original signal, c (the channel introduces noise, channel distortion, etc.). Thus, the decoder can only produce an estimate m of the original message, m. Goal of processing: Information conveyed through (or stored in) the channel must be reproduced at the destination as reliable as possible. At the same time, it needs to allow the transmission of as much information as possible per unit time (communication system) or storage (storage system). Information Source The Source Message m consists of a time sequence of symbols emitted by the information source. The source can be: Continuous-time Source, if this message is continuous in time, e.g., speech waveform. Discrete-time Source, if the message is discrete in time, e.g., data sequences from a computer. The symbols emitted by the source can be: Continuous in amplitude, e.g., speech waveform. Discrete in amplitude, e.g., text with a finite symbol alphabet. This course primarily concerns with discrete-time and discrete-amplitude (i.e. digital) sources, as practically all new communication or storage systems fall into this category. Since the information and coding theory depends on the probability theory, we need to review it first. 2

3 0.2 Review of Random Variables and Probability Probability Let us consider a single experiment, such as rolling of a dice, with a number of possible outcomes. The sample space S of the experiment consists of the set of all possible outcomes. In the case of a dice S 1, 2,3,4,5,6 dots on the six faces of the dice. Event: Complement Example 0.1: For S and A defined above, find A., with the integer representing the number of Two events are said to be Mutually Exclusive if they have no sample points in common. For example: Or: The Union of two events: The Intersection of two events: Associated with each event A contained in This has the following properties: S is its Probability, denoted by P A. For mutually exclusive events: A A, i i j, the probability of the union is P A PA j Example 0.2: if A {2,4}, find P(A). i i i i 3

4 Joint Event and Joint Probability Instead of dealing with a single experiment, let us perform two experiments and consider their outcomes. For example: The two experiments can be separate tosses of a single dice or a single toss consisting of two consecutive dices. The sample space S consists of the 36 two-tuples (i,j), where i, j 1,...,6. Each point in the sample space is assigned the probability Let us denote A i, B, j i 1,..., j 1,..., m n, as the outcomes of the first experiment, and, as the outcomes of the second experiment. Assuming that the outcomes B, j 1,..., m, are mutually exclusive and B S, it follows that: j j j If Ai, i1,..., n, are mutually exclusive and Ai i S, then In addition, Conditional Probability A joint event, A B occurs with the probability, where P A B and P B A are conditional probabilities. Example 0.3: Let us assume that we toss a dice. 1, 2,3 B 1,3, 6, find P( B A). The events are A and P A B, which can be expressed as:, 4

5 A conditional probability is. Let A and B be two events in a single experiment: If these are mutually exclusive ( A B ), then If A B,. The Bayes Theorem: Ai, i1,..., n, are mutually exclusive and If n, then Ai S i1 Statistical Independence: Let P A B be the probability of occurrence of A given that B has occurred. Suppose that the occurrence of A does not depend on the occurrence of B. Then, Example 0.4: Two successive experiments in tossing a dice 1 A2,4,6PA even-numbered sample points in the first toss 2 1 B2,4,6PB even-numbered sample points in the second toss 2 Determine the probability of the joint event even-numbered outcome on the first toss (A) and even-numbered outcome on the second toss (B) P( A, B). 5

6 Random Variables Sample space S Elements s S For examples: Xs is a Random Variable Probability Mass Function (PMF) M M p x P X x P X x x x p x x x X i i i i i1 i1 1, x xi xxi 0, otherwise p X x For example: 1/6 Definition: The Mean of the random variable X: x Example 0.5: S 1, 2,3, 4,5,6, X s s, find E(X). Useful Distributions Let X be a discrete random variable that has two possible values, say X 1 or X 0, with probabilities p and 1- p, respectively. This is the Bernoulli distribution, and the PMF can be represented as given in the figure. The mean of such a random variable is. px x The performance of a fixed number of trials 1-p p with fixed probability of success on each trial is known as a Bernoulli trial x

7 Let X i, i1,..., n, be statistically independent and identically distributed random variables with a Bernoulli distribution, and let us define a new random variable, n Y X. This random variable takes values from 0 to n. Associated probabilities can i1 be expressed as: i More generally,, n n! where is the binomial coefficient. This represents the probability to k k! n k! have k successes in n Bernoulli trials. The probability mass function (PMF) can be expressed as This represents the binomial distribution (see the website). The mean of a random variable with a binomial distribution is: EY np. Definitions: 1. The Mean of a function of the random variable X, g X, is defined as 2. The Variance of the random variable X is defined as.. Example (calculate the variance for the random variable defined in Example 0.5, whose mean is 21/6) 3. The Variance of a function of the random variable X, g X, is defined as.. 7

8 Ch 1 Discrete Source and Entropy 1.1 Discrete Sources and Entropy Source Alphabets and Entropy Overview The Information Theory is based on the Probability Theory, as the term information carries with it a connotation of UNPREDICTABILITY (SURPRISE) in the transmitted signal. The Information Source is defined by : - The set of output symbols - The probability rules which govern the emission of these symbols. Finite-Discrete Source: finite number of unique symbols. The symbol set is called the Source Alphabet. Definition A is a source alphabet with M possible symbols,. We can say that the emitted symbol is a random variable, which takes values in A. The number of elements in a set is called its Cardinality, e.g., The source output symbols can be denoted as where s t A is the symbol emitted by the source at time t. Note that here t is an integer time index. Stationary Source: the set of probabilities is not a function of time. It means, at any given time moment, the probability that the source emits a m is pm Pr( am) Probability mass function: Since the source emits only members of its alphabet, then 8

9 Information Sources Classification Stationary Versus Non-Stationary Source: For a Stationary Source the set of probabilities is not a function of time, whereas for a Non-stationary Source it is. Synchronous Source Versus Asynchronous Source: A Synchronous Source emits a new symbol at a fixed time interval, Ts, whereas for an Asynchronous Source the interval between emitted symbols is not fixed. The latter can be approximated as synchronous, by defining a null character when the source does not emit at time t. We say the source emits a null character at time t. Representation of the Source Symbols The symbols emitted by the source must be represented somehow. In digital systems, the binary representation is used. Pop Quiz: How many bits are required to represent the symbols 1, 2, 3, 4? or in a symbol of n symbols 1, 2, 3,, n? Answer: The symbols represented in this fashion are referred to as Source Data. Distinction between Data and Information For example: An information source has an alphabet with only 1 symbol. This representation of this symbol is data, but this data is not information, as it is completely uninformative. Since information carries the connotation of uncertainty, the information content of this source is zero. Question: how can one measure the information content of a source? Answer: 9

Entropy of a Source Example: Pick a marble from a bag of 2 blue, and 5 read marbles Probability for picking a red marble: p red = 5/7 Number of choices for each red picked 1 / p red = 7/5 =1.

10 Entropy of a Source Example: Pick a marble from a bag of 2 blue, and 5 read marbles Probability for picking a red marble: p red = 5/7 Number of choices for each red picked 1 / p red = 7/5 =1.4 Each transmitted Symbol 1 is just one choice out of 1/p 1 many possible choices and therefore Symbol 1 contains log 2 1/p 1 bits information (1/ p 1 = 2 log 2 1/ p 1). Similarly, Symbol k contains log 2 1/p k bits information. The average information bits per symbol for our source is Entropy, it is calculated by Shannon gave this precise mathematical definition of the average amount of information conveyed per source symbol, used to measure the information content of a source. Unit of Measure (entropy): Range of entropy: where M is the cardinality of the source 1 p m M m1,..., M A, and when,(i.e. equal probabilities), H(A) takes the maximum. 10

11 Example 1.1: What is the entropy of a 4-ary source having symbol probabilities P {0.5,0.3,0.15,0.05}? A Example 1.2: If A {0,1 } with probabilities { 1 p, p} where 0 p 1, determine the range of H (A). P A Example 1.3: For a M-ary source, what distribution of probabilities information entropy H (A)? P(A) maximizes the 11

12 Measurement of the Information Efficiency of the Source is in terms of ratio of the entropy of the source to the (average) number of binary digits used to represent the source data. Example 1.4: For a 4-ary source A {00,01,10,11} that has symbol probabilities P A {0.5,0.3,0.15,0.05}. What is the efficiency of the source? When the entropy of the source is lower than the (average) number of bits used to represent the source data, an efficient coding scheme can be used to encode the source information, using, an average, fewer binary digits. This is called Data Compression and the encoder used for that is called Source Encoder Joint and Conditional Entropy If we have two information source A and B, and we want to make a compound symbol C with c a, b ), find H(C). ij ( i j 12

13 i) If A and B are statistically independent: ii) If B depends on A: Example 1.5: We often use a parity bit for error detection. For a 4-ary information source A {0,1,2,3} with P {0.25,0.25,0.25,0.25}, and the parity generator B {0,1 } with b j A 0, if a 0or 1 { where j 1, 2, find H (A), H (B) and H ( A, B). 1, if a 2or 3 13

14 1.1.3 Entropy of Symbol Blocks and the Chain Rule To find H A, A,, ) where A t ( t 0,1,, n 1) is the symbol at index time of ( 0 1 A n 1 t that is drawn from alphabet A. Example 1.5: Suppose a memoryless source with A {0,1 } having equal probabilities emits a sequence of 6 symbols. Following the 6th symbol, suppose a 7th symbol is transmitted which is the sum modulo 2 of the six previous symbols (this is just the exclusive-or of the symbols emitted by A). What is the entropy of the 7-symbol sequence? 14

15 Example 1.6: For an information source having alphabet A with A symbols, what is the range of entropies possible? 1.2 Source Coding Mapping Functions and Efficiency For an inefficient information source, i.e. H(A) < log 2 ( A ), the communication system can be made more cost effective through source coding. Information Source Sequence s 0,s 1, s t ϵ A(source alphabet) Source Encoder Code Words s' 0,s' 1, s' t ϵ B(code alphabet) 15

16 In its simplest form, the encoder can be viewed as a mapping of the source alphabet A to a code alphabet B, i.e., C: A B. Since the encoded sequence must be decoded at the receiver end, the mapping function C must be invertible. Goal of coding: average information bits/symbol ~ average bits we use to represent a symbol (i.e. code efficiency ~ 1). Example 1.7: Let A be a 4-ary source with symbol probabilities P {0.5,0.3,0.15,0.05}, let C be an encoder with maps the symbols in A into strings of binary bits, as below p p p p , C( a 0.3, C( a ) , C( a 0.05, C( a ) ) 110 ) 111 Determine the average number of transmitted binary digits per code word and the efficiency of the encoder. A Example 1.8: Let C be an encoder grouping the symbols in A into ordered pairs a, a, the set of all possible pairs a, a is called the Cartesian product of set A i j i j and is denoted as A X A. Thus, encoder C: A X A B. or C( a, a i j ) b. Now let A be a 4-ary memoryless source with symbol probabilities given in Example 1.7, determine the average number of transmitted binary digits per code word and the efficiency of the encoder. The code words are shown in the table following. 16

17 < a i,a j > Pr< a i,a j > b m < a i,a j > Pr< a i,a j > b m a 0,a 0 00 a 2,a a 0,a a 2,a a 0,a a 2,a a 0,a a 2,a a 1,a a 3,a a 1,a a 3,a a 1,a a 3,a a 1,a a 3,a Mutual Information If we have source set A and code set B, what are the entropy relationship between them? A B 17

18 i) A B a b ii) A B ai b aj 18

19 iii) A B ai bi bj Data Compression Why Data Compression? Whenever space is concern, you would like to use data compression. For example, when sending text files over a modem or Internet. If the files are smaller, they will get faster to the destination. All media, such as text, audio, graphics or video has redundancy. Compression attempts to eliminate this redundancy. Example of Redundancy: If the representation of a media captures content that is not perceivable by humans, then removing such content will not affect the quality of the content. For example, capturing audio frequencies outside the human hearing range can be avoided without any harm to the audio s quality. Original message, A ENCODER Compressed message, B DECODER Decompressed message, A Lossless Compression: Lossy Compression: 19

20 Lossless and lossy compression are terms that describe whether or not, in the compression of the message, all original data can be recovered when decompression is performed. Lossless Compression - Every single bit of data originally transmitted remains after decompression. After decompression, all the information is completely restored. - One can use lossless compression whenever space is a concern, but the information must be the same. In other words, when a file is compressed, it takes up less space, but when it is decompressed, it still has the same information. - The idea is to get rid of redundancy in the information. - Standards: ZIP, GZIP, UNIX Compress, GIF Lossy Compression - Certain information is permanently eliminated from the original message, especially redundant information. - When the message is decompressed, only a part of the original information is still there (although the user may not notice it). - Lossy compression is generally used for video and sound, where a certain amount of information loss will not be detected by most users. - Standards: JPEG (still), MPEG (audio and video), MP3 (MPEG-1, Layer 3) Lossless Compression When we encode characters in computers, we assign each an 8-bit code based on (extended) ASCII chart. (Extended) ASCII: fixed 8 bits per character For example: for hello there!, a number of 12 characters*8bits=96 bits are needed. Question: Can one encode this message using fewer bits? Answer: Yes. In general, in most files, some characters appear most often than others. So, it makes sense to assign shorter codes for characters that appear more often, and longer codes for characters that appear less often. This is exactly what C. Shannon and R.M. Fano were thinking when created the first compression algorithm in

21 Kraft Inequality Theorem Prefix Code (or Instantaneously Decodable Code): A code that has the property of being self-punctuating. Punctuating means dividing a string of symbols into words. Thus, a prefix code has punctuating built into the structure (rather than adding in using special punctuating symbols). This is designed in a way that no code word is a prefix of any other (longer) code word. It is also data compression code. To construct an instantaneously decodable code of minimum average length (for a source A or given random variable a, with values drawn from the source alphabet), it needs to follow the Kraft Inequality: For an instantaneously decodable code B for a source A, the code lengths {l i } must satisfy the inequality Conversely, if the code word lengths satisfy this inequality, then there exists an instantaneously decodable code with these word lengths. Shanno-Fano Theorem KRAFT INEQUALITY tells us when an instantaneously decodable code exists. But we are interested in finding the optimal code, i.e., the one that maximizes the efficiency, or L minimizes the average code length,. The average code length of the code B for the source A (with a as a random variable of values drawn from the source alphabet with probabilities {p i }) is minimized if the code lengths {l i } are given by: L This quantity is called the Shannon Information (pointwise). Example 1.9: Consider the following random variable a, with the optimal code lengths given by the Shannon information. Calculate the average code length. a a 0 a 1 a 2 a 3 p i i=0,1,2,3 1/2 1/4 1/8 1/8 l i i=0,1,2, The average code length of the optimal code is: 21

22 Note that this is the same as the entropy of A, H(A). Lower Bound on the Average Length The observation about the relation between the entropy and the expected length of the optimal code can be generalized. Let B be an instantaneous code for the source A. Then, the average code length is bounded by: Upper Bound on the Average Length Let B a code with optimal code lengths, i.e., is bounded by: l log 2 i p i i. Then, the average length Why is the upper bound H(A)+1 and not H(A)? Because sometimes the Shannon information gives us fractional lengths, and we have to round them up. Example 1.10: Consider the following random variable a, with the optimal code lengths given by the Shannon information theorem. Determine the average code length bounds. a a 0 a 1 a 2 a 3 a 4 p i i=0,1,2, l i i=0,1,2, The entropy of the source A is The source coding theorem tells us: where L is the code length of the optimal code. Example 1.11: For the source in Ex. 1.10, the following code tries to make the code words with optimal code lengths as closely as possible, find the average code length. a a 0 a 1 a 2 a 3 a 4 b l i i=0,1,2,

23 The average code length for this code is. This is very close to the optimal code length of H(A)= Summary i) The motivation for data compression is to reduce in the space allocated for data (increase of source efficiency). It is obtained by reducing redundancy which exists in data. ii) Compression can be lossless or lossy. In the former case, all information is completely restored after decompression, whereas in the latter case it is not (used in applications in which the information loss will not be detected by most users). iii) The optimal code, which ensures a maximum efficiency for the source, is characterized by the lengths of the code words given by the Shannon information, log 2 pi. iv) According to the source coding theorem, the average length of the optimal code is bounded by entropy as v) The coding schemes for data compression include Huffman, Lempel-Ziv, Arithmetic coding. 1.3 Huffman Coding Remarks Huffman coding is used in data communications, speech coding, video compression. Each symbol is assigned a variable-length code that depends on its frequency (probability of occurrence). The higher the frequency, the shorter the code word. It is a variablelength code. The number of bits for each code word is an integer (requires an integer number of coded bits to represent an integer number of source symbols). It is a Prefix Code (instantaneously decodable). Encoder Tree Building Algorithm Huffman code words are generated by building a Huffman tree: Step 1 : List the source symbols in a column in descending order of probabilities. 23

24 Step 2 : Begin with the two symbols with the two lowest probability symbols. The combining of the two symbols forms a new compound symbol or a branch in the tree. This step is repeated using the two lowest probability symbols from the new set of symbols, and continues until all the original symbols have been combined into a single compound symbol. Step 3 : A tree is formed, with the top and bottom stems going from the compound symbol to the symbols which form it, labeled with 0 and 1, respectively, or the other way around. Code words are assign by reading the labels of the tree stems from right to the left, back to the original symbol. Example 1.12: Let the alphabet of the source A be {a 0, a 1, a 2, a 3 }, and the probabilities of emitting these symbols be { }. Draw the Huffman tree and find the Huffman codes. STEP 1 STEP 2 STEP 3 Probability Symbol 0.50 a a a a 3 Symbol Code Words a 0 a 1 a 2 a 3 Hardware implementation of encoding and decoding. 24

25 25

26 How are the Probabilities Known? Counting symbols in input string: - data must be given in advance; requires an extra pass on the input string. Data source s distribution is known - data not necessarily known in advance, but we know its distribution. Reasonable care must be taken in estimating the probabilities, since large errors lead to serious loss in 26

27 optimality. For example, a Huffman code designed for English text can have a serious loss in optimality when used for French. More Remarks For Huffman coding, the alphabet and its distribution must be known in advance. It achieves entropy when occurrence probabilities are negative powers of 2 (optimal code). Huffman code is not unique (because some arbitrary decisions in the tree construction). Given the Huffman tree, it is easy (and fast) to encode and decode. In general, the efficiency of Huffman coding relies on having a source alphabet A with a fairly large number of symbols. Compound symbols are obtained based on the original symbols (see, e.g., AxA). For a compound symbol formed with n symbols, the alphabet is A n, and the set of probabilities of the compound symbols is denoted by P n A. Question: How does one get P n A? Answer: Easy for a memoryless source. Difficult for a source with memory! 1.4 Lempel-Ziv (LZ) Coding Remarks LZ coding does not require the knowledge of the symbol probabilities beforehand. It is a particular class of dictionary codes. They are compression codes that dynamically construct their own coding and decoding tables by looking at the data stream itself. In simple Huffman coding, the dependency between the symbols is ignored, while in LZ, these dependencies are identified and exploited to perform better encoding. When all the data is known (alphabet, probabilities, no dependencies), it s best to use Huffman (LZ will try to find dependencies which are not there ) This is the compression algorithm used in most PCs. Extra information is supplied to the receiver, these codes initially expand. The secret is that most of the code words represent strings of source symbols. In a long message it is more economical to encode these strings (can be of variable length), than it is to encode individual symbols. 27

28 Definitions related to the Structure of the Dictionary Each entry in the dictionary has an address, m. Each entry is an ordered pair, <n, a i >. The former ( n ) is a pointer to another location in the dictionary, it is also the transmitted code word. a i is a symbol drawn from the source alphabet A. A fixed-length binary word of b bits is used to represent the transmitted code word. The number of entries will be lower or equal to 2 b. The total number of entries will exceed the number of symbols, M, in the source alphabet. Each transmitted code word contains more bits that it would take to represent the alphabet A. Question: Why do we use LZ coding if the code word has more bits? Answer: Because most of these code words represent STRINGS of source symbols other than single. Encoder A Linked-List Algorithm (simplified for the illustration purpose) is sued, it inlcudes: Step 1: Initialization The algorithm is initialized by constructing the first M +1 (null symbol plus M source symbols) entries in the dictionary, as follows. Address (m) Dictionary Entry (n, a i ) 0 0 null m 0 M 0 a 0 a 1 a m a M-1 Note: The 0-address entry in the dictionary is a null symbol. It is used to let the decoder know where the end of the string is. In a way, this entry is a punctuation mark. The pointers n in these first M+1 entries are zero. It means they point to the null entry at 28

29 address 0 at the beginning. The initialization also initializes pointer variable to zero (n=0), and the address pointer to M +1, (m=m+1). The address pointer points to the next blank location in the dictionary. Iteratively executed: Step 2: Fetch next source symbol. Step 3: If the ordered pair <n, a> is already in the dictionary, then n = dictionary address of entry <n, a> Else transmit n create new dictionary entry <n, a> at dictionary address m m = m+1 n = dictionary address of entry <0, a> Step 4: Return to Step 2. Example 1.13: A binary information source emits the sequence of symbols etc. Construct the encoding dictionary and determine the sequence of transmitted code symbols. Initialize: Source Present Present transmit Next Dictionary symbol n m n entry

30 , , , , , , , , Thus, the encoder's dictionary is: Dictionary address Dictionary entry 0 0, null 1 0, 0 2 0, , 1 7 4, 1 30

31 8 3, 0 9 6, , , , , 1 14 No entry yet Decoder The decoder at the receiver must also construct an identical dictionary for decoding. Moreover, reception of any code word means that a new dictionary entry must be constructed. Pointer n for this new dictionary entry is the same as the received code word. Source symbol a for this entry is not yet known, since it is the root symbol of the next string (which has not been transmitted by the encoder). If the address of the next dictionary entry is m, we see that the decoder can only construct a partial entry <n,?>, since it must await the next received code word to find the root symbol a for this entry. It can, however, fill in the missing symbol a in its previous dictionary entry, at address m -1. It can also decode the source symbol string associated with the received code word n. Example 1.14: Decode the received code words transmitted in Example We know the received code words are Address (m) n (pointer) a i (symbol) Decoded bits

32 Arithmetic Coding Remarks Assigns one (normally long) code word to the entire input stream. Reads the input stream symbol by symbol, appending more bits to the code word each time. The code word is a number obtained based on the symbol probabilities. The symbols probabilities need to be known. Encodes symbols using a non-integer number of bits (in average), which results in a very good efficiency of the encoder (it allows to achieve the entropy lower bound). It is often used for data compression in image processing. Encoder Construct a code interval (rather than a code number), which uniquely describes a block of successive source symbols. Any convenient b within this range is a suitable code word, representing the entire block of symbols. Algorithm: a A, I [ S, S ) i i l h j 0, L 0, H 1 j i i j 32

33 REPEAT H j - L j Next read a j1 H L + S j1 j j 1 Until all a have been encoded. i, use a i 's I i =[S li,s hi ) to update i Select a number b that fall in the final interval as the code word. L L j j + Example 1.15: For a 4-ary source A a, a, a, } with P {0.5,0.3,0.15,0.05}, { a3 assign each a i Aa fraction of the real number interval I as a 0 : I 0 [0,0.5); a1 : I1 [0.5,0.8); a2 : I 2 [0.8,0.95); a3 : I 3 [0.95,1). S l i h i i A Encode the sequence a 1a0a0a3a2 with arithmetic coding. j a i L j H j L j+1 H j a a a

34 Decoder In order to decode the message, the symbol order and probabilities must be passed to the decoder. The decoding process is identical to the encoding. Given the code word (the final number), at each iteration the corresponding sub-range is entered, decoding the symbols representing the specific range. Given b, the decoding procedure is L 0, H 1, Δ H - L Repeat F in d i su ch th at b - L I i O u tp u t sym bol a L L + H L + Δ H - L U n til la st sy m b o l is d eco d ed. S S l i h i i, use a i 's I i =[S li,s hi ) to update Example 1.16: For the source and encoder in Example 1.15, decode b L H Ii Next H Next L Next a i I a I a I a 2 Practical Issues Attention: the precision with which we calculate ( b L) /. Round-off error in this calculation can lead to an erroneous answer. Numerical overflow (see the products and ). The limited size of and limits the size of the S li S h i S li S h i 34

35 alphabet A. In practice it is important to transmit and decode the info on the fly. Here we must read in the entire block of source symbols before being able to compute the code word. We also must receive the entire code word b before we can begin decoding. Huffman Arithmetic Lempel-Ziv Probabilities Known in advance Known in advance Not known in advance Alphabet Known in advance Known in advance Not known in advance Data Loss None None None Symbol Dependency Not used Not used Used for better compression Entropy If probabilities are negative powers of 2 Very close Best results for long messages Code words One code word for each symbol One code word for all data Code words for strings of source symbols Intuition Intuitive Not intuitive Not intuitive 35

36 Ch 2 Channel and Channel Capacity Communication Link 2.1 Discrete Memoryless Channel Model Composite Discrete-Input Discrete-Output Channel Channel Encoder Modulator Channel Demodulato r Source Encoder Informatio c 0,c 1,...,c t Continuous-Input Continuous-Output Channel Channel Decoder Source Decoder Alphabet C Probabilities P C y0,y1,...,yt Alphabet Y Probabilities P Y Definition In most communication or storage systems, the signal is designed such that the output symbols, y 0,y 1,...,y t, are statistically independent if the input symbols, c 0,c 1,...,c t, are statistically independent. If the output set Y consists of discrete output symbols, and if the property of statistical independence of the output sequence holds, the channel is called a Discrete Memoryless Channel (DMC). Transition Probability Matrix Mathematically, we can view the channel as a probabilistic function that transforms a sequence of (usually coded) input symbols, c, into a sequence of channel output symbols, y. Because of noise an other impairments of the communication system, the transformation is not one-to-one mapping from the set of input symbols, C, to the set of 36

37 output symbols, Y. Any particular c from C may have some probability, p y c, of being transformed to an output symbol y, from Y, this probability is called a (Forward) Transition Probability. For a DMC, let p c be the probability that symbol c is transmitted, the probability that the received symbol is y is given in terms of transition probabilities as The probability distribution of the output set Y, denoted by Q Y, may be easily calculated in matrix form as Q Y q p p... p p y c y c y cm C 1 0 q p 1 y1 c p 0 y1 c... p 1 y1 cm C p qmy 1 py p 1 1 M 1 C M c py Y M c p y Y M c Y MC or, more compactly, Here, P C : Probability distribution of the input alphabet Q Y : Probability distribution of the output alphabet P Y C : Remarks: The columns of P Y C sum to unity (no matter what symbol is sent, some output symbol must result). Numerical values for the transition probability matrix are determined by analysis of the noise and transmission impairment properties of the channel, and the method of modulation/demodulation. Hard Decision Decoding : M Y = M C. Hard refers to the decision that the demodulator makes; it is a firm decision on what symbol was transmitted. Soft Decision Decoding : M Y > M C. The final decision is left to the receiver decoder. 37

38 Example 2.1: C={0,1}, with equally probable symbols; Y={y0, y1, y2}. The transition probability matrix of the channel is P YC Q Y =? Remarks: The sum of the elements on each column of the transition probability matrix is 1. This is an example of soft-decision decoding. Example 2.1 (cont d): Calculate the entropy of Y for the previous system. Compare this with the entropy of source C. (how can this happen?) Remarks: We noticed the same thing when we discussed the source encoder (encryption encoder). It is possible for the output entropy to be greater than the input entropy, but the additional information carried in the output is not related to the information from the source. The extra information in the output comes from the presence of noise in the channel during transmission, and not from the source C. This extra information carried in Y is truly useless. In fact, it is harmful because it produces uncertainty about what symbols were transmitted. Question: Can we solve this problem by using only systems which employ hard-decision decoding? 38

39 Answer: Example 2.2: C={0,1}, with equally probable symbols; Y={0,1}. The transition probability matrix of the channel is PYC Calculate the entropy of Y. Compare this with the entropy of source C. Remarks: Y carries less information than was transmitted by the source. Question: Where did it go? Answer: It was lost during the transmission process. The channel is information lossy! So far, we have looked at two examples, in which the output entropy was either greater or less than the input entropy. What we have not considered yet is what effect all this has on the ability to tell from observing Y what original information was transmitted. Do not forget that the purpose of the receiver is to recover the original transmitted information! What does the observation of Y tell us about the transmitted information sequence? As we know, Mutual information is a measure of how much the uncertainty of generating a random variable c is reduced by observing a random variable y! If Y tells us nothing about C (e.g., Y and C are independent, such as somebody cut the phone wire and there is no signal getting through). But if 39

40 Looking at Y there is no uncertainty on C. i.e., Y contains sufficient information to tell what the transmitted sequence is. The conditional entropy is a measure of how much information loss occurs in the channel! Example 2.3: Calculate the mutual information for the system of Example 2.1. Remark: The mutual information for this system is well below the entropy ( H(C)=1 ) of the source and so, this channel has a high level of information loss. Example 2.4: Calculate the mutual information for the system of Example 2.2. Remarks: This channel is quite lossy also. Although H(Y) was almost equal to H(C) in Example 2.2, the mutual information is considerably less than H(C). One cannot tell how much information loss we are dealing with simply by comparing the input and output entropies! 2.2 Channel Capacity and Binary Symmetric Channel Maximization of Mutual Information and Channel Capacity Each time the transmitter sends a symbol, it is said to use the channel. The Channel Capacity is the maximum average amount of information that can be sent per channel use. 40

41 Question: Why it is not the same as the mutual information? Answer: Because for a fixed transition probability matrix, a change in the probability distribution of C, P C, results in a different mutual information, I(C;Y).The maximum mutual information achieved for a given transition probability matrix is the Channel Capacity. with units of bits per channel use. An analytical closed-form solution to find C C is difficult to achieve for an arbitrary channel. An efficient numerical algorithm for finding C C was derived in 1972, by Blahut and Arimoto (see textbook). Example 2.5: For the following transition probability matrix, find the channel capacity, the input and output probability distributions that achieve the channel capacity, and mutual information given a uniform Pc. a) P , , P, Q C YC C C Y b) PYC, ,, CC PC QY c PYC, CC , PC, QY d) PYC, ,, CC PC Y Q e) P YC , CC , PC, Q Y

42 Remarks: The channel capacity proves to be a sensitive function of the transition probability matrix, P Y C, but a fairly weak function of P C. The last case is interesting, as the uniform input distribution produces the maximum mutual information. This is an example of Symmetric Channel. Note that the columns of symmetric channel s transition probability matrix are permutations of each other. Likewise, the top and bottom rows are permutations of each other. The center row, which is not a permutation of the other rows, corresponds to the output symbol y 1, which, as we noticed in Example 2.3, makes no contribution to the mutual information. Symmetric Channels Symmetric channels play an important role in communication systems and many such systems attempt, by design, to achieve a symmetric channel function. The reason for the importance of the symmetric channel is that when such a channel is possible, it frequently has greater channel capacity than an non-symmetric channel would have. Example 2.6: PYC , CC , PC, Q Y The transition probability matrix is slightly changed compared to Example 2.5e), and the channel capacity decreases. Example 2.7: P P 0.25, C , YC C C This is an example of using quadrature phase-shift keying (QPSK), which is a modulation method that produces a symmetric channel. For QPSK, M C =M Y =4. 42

43 Remarks: i) The capacity for this channel is achieved when P C is uniformly distributed. This is always the case for a symmetric channel. ii) The columns of the transition probability matrix are permutations of each other, and so are the rows. iii) When the transition probability matrix is a square matrix, this permutation property of columns and rows is sufficient condition for a uniformly distributed input alphabet to achieve the maximum mutual information. Indeed, the permutation condition is what it gives rise to the term symmetric channel. Binary Symmetric Channel (BSC) A symmetric channel of considerable importance, both theoretically and practically, is a binary symmetric channel (BSC), for which The parameter p is known as the Crossover Probability, and it is the probability that the demodulator/detector makes a hard-decision decoding error. The BSC is the model for essentially all binary-pulse transmission systems of practical importance. Channel Capacity: for uniform input probability distribution which is often written as where the notation H(P) arises from the terms involving p. Remarks: The capacity is bounded by the range The upper bound is achieved only if The case p = 0 is not surprising, as it corresponds to a channel which does not make errors (known as noiseless channel). 43

44 The case p = 1 corresponds to a channel which always makes errors. If we know that the channel output is always wrong, we can easily set things right by decoding the opposite of what the channel output is. The case p = 0.5 corresponds to a channel for which the output symbol is as likely to be correct as it is to be incorrect. Under this condition, the information loss in the channel is total, and the channel capacity is zero. The capacity of the BSC is a concave-upward function, possessing a single minimum at p = 0.5. Except for p = 0 and p = 1 cases, the capacity of the BSC is always less than the source entropy. If we try to transmit information through the channel using the maximum amount of information per symbol, some of this info will be lost, and decoding errors at the receiver will result. However, if we add sufficient redundancy to the transmitted data stream, it is possible to reduce the probability of lost information to an arbitrary low level. 2.3 Block Coding and Shannon s 2nd Theorem Equivocation We have seen that there is a maximum amount of information per channel use that can be supported by the channel. Any attempt to exceed this channel capacity will result in information being lost during transmission. That is, and, so The conditional entropy H(C Y) corresponds to our uncertainty about what the input of the channel was, given our observation of the channel output. It is a measure of the information loss during the transmission. For this reason, this conditional entropy is often called the Equivocation. The equivocation has the property that and it is given by 44

45 The equivocation is zero if and only if the transition probabilities p y c are either zero or one for all pairs (yy, cc). Entropy Rate The entropy of a block of n symbols satisfy the inequality with equality if and only if C is a memoryless source. In transmitting a block of n symbols, we use the channel n times. Recall that channel capacity has units of bits per channel use, and refers to an average amount of information per channel use. Since H(C 0,C 1,...,C n-1 ) is the average information contained in the n-symbol block, it follows that the average information per channel use would be However, the average bits per channel use is achieved in the limit, when n goes to infinity, such that R H ( C, C,..., C ) C n 0 1 n 1 lim H ( ) n where R is called the Entropy Rate., with equality if and only if all symbols are statistically independent. R HC ( ) Suppose that they are not, and in the transmission of the block, we deliberately introduce redundant symbols. Then, R < H(C). Taking this further, suppose that we introduce a sufficient number of redundant symbols in the block so that Question: Is the transmission without information loss (i.e. zero equivocation) possible in such case? Answer: Remarkably enough, the answer to this question is YES! What is the implication of doing so? It is possible to send information through the channel with arbitrarily low probability of error. The process of adding redundancy to a block of transmitted symbols is called Channel Coding. 45

46 Question: Does there exist a channel code that will accomplish this purpose? Answer: The answer to this question is given by the Shannon s second theorem. Shannon s 2nd Theorem Suppose R < Cc, where Cc is the capacity of a memoryless channel. Then, for any > 0, there exists a block of length n and rate R whose probability of block decoding error p e satisfies p e when the code is used on this channel. Shannon s second theorem (also called Shannon s main theorem) tells us that it is possible to transmit information over a noisy channel with arbitrarily small probability of error. The theorem says that if the entropy rate R in a block of n symbols is smaller than the channel capacity, then we can make the probability of error arbitrarily small. What error are we speaking about? Suppose we send a block of n bits in which k < n of these bits are statistically independent information bits and n-k are redundant parity bits computed from the k information bits, according to some coding rule. The entropy of the block will then be k bits and the average information in bits per channel use will be If this entropy rate is less than the channel capacity, Shannon s main theorem says we can make the probability of error in recovering our original k information bits arbitrarily small. The channel will make errors within our block of n bits, but the redundancy built into the block will be sufficient to correct these errors and recover the k bits of information we transmitted. Shannon s theorem does not say that we can do this for just any block length n we might want to choose! The theorem says there exists a block length n for which there is a code of rate R. The required size of the block length n depends on the upper bound we pick for our error probability. Actually, Shannon s theorem implies very strongly that the block length n is going to be very large if R is to approach C C to within an arbitrarily small distance with an arbitrarily probability of error. The complexity and expense of an error-correcting channel code are believed to grow rapidly as R approaches the channel capacity and the probability of a block decoding 46

47 error is made arbitrarily small. It is believed by many that beyond a particular rate, called Cutoff Rate, R 0, it is prohibitively expensive to use the channel. In the case of the binary symmetric channel, this rate is given by R 0 2 log 0.5 p(1 p) The belief that R 0 is some kind of sound barrier for practical error correcting codes comes from the fact that for certain kind of decoding methods, the complexity of the decoder grows extremely rapidly as R exceeds R Markov Processes and Sources with Memory Markov Process Thus far, we have discussed memoryless sources and channels. We now turn our attention to sources with memory. By this, we mean information sources, where the successive symbols in a transmitted sequence are correlated with each other, i.e., the sources in a sense remember what symbols they have previously emitted, and the probability of their next symbol depends on this history. Sources with memory arise in a number of ways. First, natural languages, such as English, have this property. For example, the letter q in English is almost always followed by the letter u. Similarly, the letter t is followed by the letter h approximately 37% of the time in English text. Many real-time signals, such as speech waveform, are also heavily time correlated. Any time correlated signal is a source with memory. Finally, we sometimes wish to deliberately introduce some correlation (redundancy) in a source for purposes of block coding, as discussed in the previous section. Let A be the alphabet of a discrete source having M A symbols, and suppose this source emits a time sequence of symbols (s 0,s 1,,s t, ) with each s t A. If the conditional probability p(s t s t-1,,s 0 ) depends only on j previous symbols, so that p(s t s t-1,,s 0 )=p(s t s t-1,,s t-j ), then A is called a j th order Markov process. The string of j symbols is called the state of the Markov process at time t. A j th order Markov process, therefore, has possible states. 47

48 Let us number these possible states from 0 to N -1 and let n (t) represent the probability of being in state n at time t. The probability distribution of the system at time t can then be represented by the vector For each state at time t, there are M A possible next states at time t +1, depending on which symbol is emitted next by the source. If we let p i k be the conditional probability of going to state i given that the present state is k, the state probability distribution at time t + 1 is governed by the transition probability matrix. and is given by P A p p... p p p... p p p... p N N 1 N1 0 N1 1 N1 N1 Example 2.8: Let A be a binary first-order Markov source with A={0,1}. This source has 2 states, labeled 0 and 1. Let the transition probabilities be PA What is the equation for the next probability state? Find the state probabilities at time t=2, given that the probabilities at time t=0 are 0 =1 and 1 =0. The next-state equation for the state probabilities: 48

49 Example 2.9: Let A be a second-order binary Markov source with Pr( a0 0,0) 0.2 Pr( a1 0,0) 0.8 Pr( a0 0,1) 0.4 Pr( a1 0,1) 0.6 Pr( a0 1,0) 0.0 Pr( a1 1,0) 1.0 Pr( a0 1,1) 0.5 Pr( a1 1,1) 0.5 If all the states are equally probable at time t = 0, what are the state probabilities at t =1? Define the. The possible state transitions and their associated transition probabilities can be represented using a state diagram. For this problem, the state diagram is The next state probability equation is Remarks: Every column of the transition probability matrix adds to one. Every properly constructed transition probability matrix has this property. 49

50 Steady State Probability and the Entropy Rate Starting from the equation for the state probabilities, is can be shown by induction that the state probabilities at time t are given by A Markov process is said to be Ergodic if we can get from the initial state to any other state in some number of steps and if, for large t, Π t approaches a steady-state value that is independent of the initial probability distribution, Π 0. The steady-state value is reached when The Markov processes which model information sources are always ergodic. Example 2.10: Find the steady-state probability distribution for the source in Example 2.9. In the steady state, the state probabilities become It appears from this that we have four equations and four unknowns, so, solving for the four probabilities is no problem. However, if we look closely, we will see that only three of the equations above are linearly independent. To solve for the probabilities, we can use any of three of the above equations and the constraint equation. This equation is a consequence of the fact that the total probability must sum to unity; it is certain that the system is in some state! Dropping the first equation above and using the constraint, we have

51 which has the solution 1/ 9, 2 / 9, 4 / This solution is independent of the initial probability distribution. The situation illustrated in the previous example, where only N - 1 of the equations resulting from the transition probability expression are linearly independent and we must use the sum to unity equation to obtain the solution, always occurs in the steady-state probability solution of an ergodic Markov process. Entropy Rate of an Ergodic Markov Process POP QUIZ: How do you define the entropy rate? The entropy rate, R, is the average information per channel use (average info bits per channel use) R H( A, A,..., A ) lim 0 1 t1 H( A) t t with equality if and only if all symbols are statistically independent. For ergodic Markov sources, as t grows very large, the state probabilities converge to a steady-state value, n, for each of the N possible states (n=0,...,n-1). As t becomes large, the average information per symbol in the block of symbols will be determined by the probabilities of occurrence of the symbols in A, after the state probabilities converge to their steady-state values. Suppose we are in state S n at time t. The conditional entropy of A is Since each possible symbol a leads to a single state, S n can lead to M A possible next states. The remaining N - M A states cannot be reached from S n, and for these states the transition probability p i n =0. Therefore, the conditional entropy expression can be expressed in terms of the transition probabilities as For large t, the probability of being in state S n is given by its steady-state probability n. Therefore, the entropy rate of the system is 51

52 This expression, in turn, is equivalent to where p i n are the entries in the transition probability matrix and the n are the steady-state probabilities. Example 2.11: Find the entropy rate for the source in Example 2.9. Calculate the steadystate probability of the source emitting a 0 and the steady-state probability of the source emitting a 1. Calculate the entropy of a memoryless source having these symbol probabilities and compare the result with the entropy rate of the Markov source. With the steady-state probabilities calculated in Example 2.10, by applying the formula for the entropy rate of an ergodic Markov source, one gets The steady state probabilities of emitting 0 and 1 are, respectively The entropy of a memoryless source having this symbol distribution is Thus, R<H(X) as expected. Remarks: i) In earlier section, we discussed about how introducing redundancy into a block of symbols can be used to reduce the entropy rate to a level below the channel capacity and 52

53 how this technique can be used for error correction at the receive-side, in order to achieve an arbitrarily small information bit error rate. ii) In this section, we have seen that a Markov process also introduces redundancy into the symbol block. Question: Can this redundancy be introduced in such a way such to be useful for error correction? Answer: YES! This is the principle underlying a class of error correcting codes known as convolutional codes. iii) In the previous lecture we examined the process of transmitting information C through a channel, which produces a channel output Y. We have found out that a noisy channel introduces information loss if the entropy rate exceeds the channel capacity. iv) It is natural to wonder if there might be some (possible complicated) form of data processing which can be performed on Y to recover the lost information. Unfortunately, the answer to this question is NO! Once the information has been lost, it is gone! Data Processing Inequality This states that additional processing of the channel output can at best result in no further loss of information, and may even result in additional information loss. Y Data Processing Z A very common example of this kind of information loss is the roundoff or truncation error during digital signal processing in a computer or microprocessor. Another examples is quantization in an analog to digital converter. Designers of these systems need to have an awareness of the possible impact of such design decisions, as the word length of the digital signal processor or the number of bits of quantization in analog to digital converters, on the information content. 53

54 2.5 Constrained Channels Channel Constraints So far, we have considered only memoryless channels corrupted by noise, which are modeled as discrete-input discrete-output memoryless channels. However, in many cases we have channels which place constraints on the information sequence. Sampler Modulator Bandlimited Channel + Demodulator s(t) a t Noise Timing Recovery Symbol Detector Block Diagram of a Typical Communication System. The coded information a t is presented to the modulator, which transforms the symbol sequence into continuous-valued waveform signals, designed to be compatible with the physical channel (bandlimited channel). Examples of bandlimited channels are wireless channels, telephone lines, TV cables, etc. During transmission, the information bearing signal is distorted by the channel and corrupted with noise. The output of the demodulator, which attempts to combat the distortion and minimize the effect of the noise, is sampled and the detector attempts to reconstruct the original coded sequence, at. The timing recovery is required; the performance of this block are crucial in recovering the information. The theory and practice of performing these tasks consist the modulation theory, which is treated in Digital Communications textbooks. In this course, we are concern with the information theory aspects of this process. What are these aspects? y t Remarks: i) When the system needs to recover the timing information, additional information should be transmitted for that. As the maximum information rate is limited by the 54

55 channel capacity, the information needed for timing recovery is included at the expense of user information. This may require that the sequence of transmitted symbols be constrained in such a way as to guarantee the presence of timing infomation embedded within the transmitted coded sequence. ii) Another aspect arises from the type and severity of channel distortions imposed by the physical bandlimited channel. We can think of the physical channel as performing a kind of data processing on the information bearing waveform presented to it by the modulator. But data processing might result in information loss. A given channel can thus place its own constraints on the allowable symbol sequence which can be process without information loss. iii) Modulation theory tells us that it is possible and desirable to model the communication channel as a cascade of noise-free channel and an unconstrained noisy channel (we have implicitly used such a model, except that we have not considered any constraint on the input symbol sequence). Constrained + Channel, h t a t x t r t Decision Block y t Noise, n t Linear and Time-Invariant (LTI) Channel The LTI channel is specified by a set of parameters h t, which represent the channel impulse response. The channel s output sequence is related to the input sequence as The decision block is presented with a noisy signal The decision block takes these inputs and produces output symbols, y t, drawn from a finite alphabet Y, with M Y M A. 55

56 If M Y =M A, y t is an estimate of the transmitted symbol a t, and the decision block is said to make a Hard-decision. If M Y > M A, the decision block is said to make a Soft-decision, and the final decision on the transmitted symbol a t is made by the decoder. Example 2.12: Let A be a source with equiprobable symbols, A={-1,1}. The bandlimited channel has the impulse response {h 0 =1 h 1 =0 h 2 =-1}. Calculate the steady-state entropy of the constrained channel s output and the entropy rate of the sequence x t. State of the channel at time t : S t = <a t-1,a t-2 >. The states are as follows: (-1,-1) is state S 0, (1,-1) is state S 1, (-1, 1) is state S 2, (1, 1) is state S 3. The channel can be represented as a Markov process, with the state diagram given in the sequel. -1 / 0 (0.5) -1-1 S 0 1 / 2 (0.5) 1-1 S 1-1 / -2 (0.5) -1 / 0 1 / 0 1 / 2 (0.5) -1 1 S 2-1 / -2 (0.5) S / 0 (0.5) Note that all transition probabilities, shown in parentheses, are 0.5. The arrows are labeled a t / x t. One can easily show that X={-2, 0, 2}. The state probability equation is then given by t t

57 from which we set up 4 equations and find the steady state probabilities, i.e., i =0.25, i=0,1,2,3. The output symbol X's probabilities are: The steady state entropy of the channel output is The entropy rate is: which equals the source entropy channel is lossless. Note that the entropy rate is not equal to the steady state entropy of the channel s output symbols. While the channel is lossless, the sequences it produces does not carry sufficient information to permit clock recovery for arbitrary input sequences. For example, a long input sequence of -1, +1, or a long sequence of alternating symbols, +1-1 or -1+1, all produce a long output of zeros at the output of the channel. Timing recovery methods can fail in such situations. 57

58 Ch 3 Error Control Strategies Error Control Strategies Forward Error Correction (FEC) Automatic Repeat Request (ARQ) Forward Error Correction (FEC) In a one-way communication system: The transmission or recording is strictly in one direction, from transmitter to receiver. Error control strategy must be FEC; that is, they employ error-correcting codes that automatically correct errors detected at the receiver. For example: 1) digital storage systems, in which the information recorded can be replayed weeks or even months after it is recorded, and 2) deep-space communication systems. Most of the coded systems in use today employ some form of FEC, even if the channel is not strictly one-way! However, for a two-way system, the control strategies use error detection and retransmission that is called automatic repeat request (ARP). 3.1 Automatic Repeat Request Automatic Repeat Request (ARQ) In most communication systems, the information can be sent in both directions, and the transmitter also acts at a receiver (transceiver), and vice-versa. For example: data networks, satellite communications, etc. Error control strategies for a two-way system can include error detection and retransmission, called Automatic Repeat Request (ARQ). In an ARQ system, when errors are detected at the receiver, a request is sent for the transmitter to repeat the message, and repeat requests continue to be sent until the message is correctly received. ARQ SYSTEMS Stop-and-Wait ARQ Continuous ARQ Go-Back-N ARQ Selective ARQ 58

59 Types Stop-and-Wait (SW) ARQ: The transmitter sends a block of information to the receiver and waits for a positive (ACK) or negative (NAK) acknowledgment from the receiver. If an ACK is received (no error detected), the transmitter sends the next block. If a NAK is received (errors detected), the transmitter resends the previous block. When the errors are persistent, the same block may be retransmitted several times before it is correctly received and acknowledged. Continuous ARQ: The transmitter sends blocks of information to the receiver continuously and receives acknowledgments continuously. When a NAK is received, the transmitter begins a retransmission. It may back-up to the block and resend that block plus the N-1 blocks that follow it. This is called Go-Back-N (GBN) ARQ. Alternatively, the transmitter may simply resend only those blocks that are negatively acknowledged. This is known as Selective Repeat (SR) ARQ. Comparison GBN Versus SR ARQ SR ARQ is more efficient than GBN ARQ, but requires more logic and buffering. Continuous Versus SW ARQ Continuous ARQ is more efficient than SW ARQ, but it is more expensive to implement. For example: In a satellite communication, where the transmission rate is high and the round-trip delay is long, continuously ARQ is used. SW ARQ is used in systems where the time taken to transmit a block is long compared to the time taken to receive an acknowledgment. SW ARQ is used on half-duplex channels (only one way transmission at a time), whereas continuous ARQ is designed for use on full-duplex channels (simultaneous two-way transmission). Performance Measure Throughput Efficiency: is the average number of information (bits) successfully accepted by the receiver per unit of time, over the total number of information digits that could have been transmitted per unit of time. 59

60 Delay of a Scheme: The interval from the beginning of a transmission of a block to the receipt of a positive acknowledgment for that block. GBN Versus SR ARQ Figure 1 From Lin and Costello, Error Control ARQ Versus FEC The major advantage of ARQ versus FEC is that error detection requires much simpler decoding equipment than error correcting. Also, ARQ is adaptive in the sense that information is retransmitted only when errors occurs. In contrast, when the channel error is high, retransmissions must be sent too frequently, and the SYSTEM THROUGHPUT is lowered by ARQ. In this situation, a HYBRID combination of FEC for the most frequent error patterns along with error detection and retransmission for the less likely error patterns is more efficient than ARQ alone (HYBRID ARQ). 60

61 3.2 Forward Error Correction Performance Measures Error Probability The performance of a coded communication system is in general measured by its probability of decoding error (called the Error Probability) and its coding gain over the uncoded system that transmit information at the same rate (with the same modulation format). There are two types of error probabilities, probability of word (or block) error and probability of bit error. The probability of block error is defined as the probability that a decoded word (or block) at the output of the decoder is in error. This error probability is often called the Word-Error Rate (WER) or Block-error Rate (BLER). The probability of bit-error rate, also called the Bit Error Rate (BER), is defined as the probability that a decoded information bit at the output of the decoder is in error. A coded communication system should be designed to keep these two error probabilities as low as possible under certain system constraints, such as power, bandwidth and decoding complexity. The error probability of a coded communication system is commonly expressed in terms of the ratio of energy-per information bit, E b, to the one-sided power spectral density (PSD) N 0 of the channel noise. Example 3.1: Consider a coded communication system using an (23, 12) binary Golay code for error control. Each code word consists of 23 code digits, of which 12 are of information. Therefore, there are 11 redundant bits, and the code rate is R=12/23= Suppose that BPSK modulation with coherent detection is used and the channel is AWGN, with one-side PSD N 0. Let E b / N 0 at the input of the receiver be the signal-tonoise ratio (SNR), which is usually expressed in db. 61

62 The bit-error performance of the (23,12) Golay code with both hard- and soft-decision decoding versus SNR is given, along with the performance of the uncoded system. 2 From Lin and Costello, Error Control From the above figure, the coded system, with either hard- or soft-decision decoding, provides a lower bit-error probability than the uncoded system for the same SNR, when the SNR is above a certain threshold. With hard-decision, this threshold is 3.7 db. For SNR=7dB, the BER of the uncoded system is 8x10-4, whereas the coded system (hard-decision) achieves a BER of 2.9x10-5. This is a significant improvement in performance. For SNR=5dB this improvement in performance is small: 2.1x10-3 compared to 6.5x10-3. However, with soft-decision decoding, the coded system achieves a BER of 7x

63 Performance Measures Coding Gain The other performance measure is the Coding Gain. Coding gain is defined as the reduction in SNR required to achieve a specific error probability (BER or WER) for a coded communication system compared to an uncoded system. Example 3.1 (cont d): Determine the coding gain for BER=10-5. For a BER=10-5, the Golay-coded system with hard-decision decoding has a coding gain of 2.15 db over the uncoded system, whereas with soft-decision decoding, a coding gain of more than 4 db is achieved. This result shows that soft-decision decoding of the Golay code achieves 1.85 db additional coding gain compared to hard-decision decoding at a BER of This additional coding gain is achieved at the expense of higher decoding complexity. Coding gain is important in communication applications, where every db of improved performance results in savings in overall system cost. Remarks: At sufficient low SNR, the coding gain actually becomes negative. This threshold phenomenon is common to all coding schemes. There always exists an SNR below which the code loses its effectiveness and actually makes the situation worse. This SNR is called the Coding Threshold. It is important to keep this threshold low and to maintain a coded communication system operating at an SNR well above its coding threshold. Another quantity that is sometimes used as a performance measure is the Asymptotic Coding Gain (the coding gain for large SNR). 3.3 Shannon s Limit of Code Rate Shannon s Limit 63

64 In designing a coding system for error control, it is desired to minimize the SNR required to achieve a specific error rate. This is equivalent to maximizing the coding gain of the coded system compared to an uncoded system using the same modulation format. A theoretical limit on the minimum SNR required for a coded system with code rate R to achieve error-free communication (or an arbitrarily small error probability) can be derived based on Shannon s noisy coding theorem. This theoretical limit, often called the Shannon Limit, simply says that for a coded system with code rate R, error-free communication is achieved only if the SNR exceeds this limit. As long as SNR exceeds this limit, Shannon s theorem guarantees the existence of a (perhaps very complex) coded system capable of achieving error-free communication. For transmission over a binary-input, continuous-output AWGN with BPSK signaling, the Shannon s limit, in terms of SNR as a function of the code rate does not have a close form; however, it can be evaluated numerically db 3 64

65 Convolutional Code, R=1/ db db Shannon s limit 4 From Lin and Costello, Error Control From Fig. 3 (Shannon limit as a function of the code rate for BPSK signaling on a continuous-output AWGN channel), one can see that the minimum required SNR to achieve error free communication with a coded system with rate R=1/2, is db. The Shannon limit can be used as a yardstick to measure the maximum achievable coding gain for a coded system with a given rate R over an uncoded system with the same modulation format. For example, to achieve BER=10-5, un uncoded BPSK system requires an SNR of 9.65 db. For a coded system with code rate R=1/2, the Shannon limit is db. Therefore, the maximum potential coding gain for a coded system with code rate R=1/2 is db. For example (Fig. 4), a rate R=1/2 convolutional code with memory order 6, achieves BER=10-5 with SNR=4.15 db, and achieves a code gain of 5.35 db compared to the uncoded system. However, it is db away from the Shannon s limit. This gap can be reduced by using a more powerful code. 65

66 3.4 Codes for Error Control Basic Concepts in Error Control There can be a hybrid of the two approaches, as well. Codes for Error Control (FEC) 66

67 Types of Channnels Types of Channels Random Error Channels Burst-Error Channels Compound Channels Random Error Channels: are memoryless channels; the noise affects each transmitted symbol independently. Example: deep space and satellite channels, most line-of-sight transmission. Burst Error Channels: are channels with memory. Example: fading channels (the channel is in a bad state when a deep fade occurs, which is caused by multipath transmission) and magnetic recordings subject to dropouts caused by surface defects and dust particles. Compound Channels: both types of errors are encountered. 67

68 Ch 4 Error Detection and Correction Source Encoder ECC Encoder DTC Encoder Channel At Transmitter DTC Decoder ECC Decoder Source Decoder At Receiver Encoding and Decoding Procedure 4.1 Error Detection and Correction Capacity Definition A code can be characterized in terms of its amount of error detection capability and error correction capability. The Error Detection Capability is the ability of the decoder to tell if an error has been made in transmission. The Error Correction Capability is the ability of the decoder to tell which bits are in error. 68

69 Binary Code, M={0,1} Coded sequence, C Channel Encoder m ( m ( 0,..., n 1), 0,..., mk 1) c c c n k Message Code word Message: block of k bits One To One Correspondence G is the encoding rule Code Word: block of n bits Only 2 k out of 2 n are used as code words. Assumptions: - independent bits - each message is equally probable, 2 k equally likely messages, of k bits each - r = n-k redundant bits Thus, the Entropy Rate of the coded word is, this is also called the Code Rate. c, c C, i j, d ( c, c ) For every i j H i j is the Hamming distance between the two code words. The Hamming Distance is defined as the number of bits which are different in the two code words. There is at least one pair of code words for which the distance is the least. This is called the Minimum Hamming Distance of the code. Example 4.1 (Repetition Code) : Given encoding rule: G(0) 000 G(1) 111 i.e. only two valid code words. Find its code rate and Hamming weight. Hamming Weight w H of a code word is defined as the number of 1 bits in the code word (the Hamming distance between the code word and the zero code word). 69

70 Example 4.1 (cont d) : For the received words in the 1st column of the Table below, determine their source words. Decision: based on the minimum Hamming distance between the received word and the code words. The code corrects 1 error (d H =1), but does not simultaneously detect the 2 bit error. Moreover, we can miscorrect the received word. The code detects up to two bits in error (3 bits in error lead to a code word; dmin between the two code words is 3). Received Word Decoded Word Error Flag Example 4.2 (Repetition Code) : Given coding rule, G(0) 0000 G(1) 1111 find decoded words for the received words in the table on the next page. n = 4, k = 1, r = 3, d min = 4, R=1/4 Correct 1 error (d H =1) and Detect 2 errors (d H =2) An error of 3 or 4 bits will be miscorrected. 70

71 Received Word Decoded Word Received Word Decoded Word Hamming Distance and Code Capability 1. Detect Up to t Errors IF AND ONLY IF Example: Repetition Code, n = 3, k = 1, r = 2, d min = 3. This code detects up to t = 2 errors. 2. Correct Up to t Errors IF AND ONLY IF Example: Repetition Code, n = 3, k = 1, r = 2, d min = 3. This code corrects t = 1 error. 3. Detect Up to t d Errors and Correct Up to t c Errors IF AND ONLY IF Example: Repetition Code, n = 3, k = 1, r = 2, d min = 3. This code cannot simultaneously correct (t c = 1) and detect (t d = 2) errors. Number of Redundant Bits The minimum Hamming Distance is related to the number of redundant bits, r 71

72 This gives us the lower limit on the number of the redundant bits for a certain minimum Hamming distance (certain detection and correction capability), and it is called the Singleton Bound. For example: Repetition Code, n = 3, k = 1, r = 2, d min = 3. d min = r +1 See its error detection and correction capabilities as previously discussed. 4.2 Linear Block Codes Definition Linear Block Codes can be mathematically treated using the mathematics of vector spaces. Linear Block Codes Binary (We deal here only with such codes) Non-Binary Reed-Solomon Galois Field has two elements, i.e., A={0,1} or A=GF(2) (,, ) A Exclusive Or And (Digital Logic) (Digital Logic)

73 ( A n,, ) Vector Addition Scalar Multiplication Vector space A n is a set with elements a ( a0,..., an1), with each ai A The set of code words, C, is a subset of A n. It is a subspace (2 k elements); any subspace is also a vector space. If the sum of two code words is also a code word, such a code is called a Linear Code). Consequence : All-zero vector is a code word, 0C (because c1c10) Vector Space Linear Independent : For code words, c,..., ck 0 1 if and only if a, these 0,..., 0... ak 1 0 k 1 are linear independent, and they are Basis Vectors. c c If they are linear independent and if and only if every cc can be uniquely written as c a c... ak ck then, the Dimension of a vector space is defined as the number of basis vectors it takes to describe (span) it. Generating Code Word Question: how do we generate a code word? 73

74 c mg m m0 mk 1 c (,..., ) ( c,..., c ), n k 0 n 1 Code Word 1 x n Message 1 x k Generator Matrix k x n G g... g 0 k1 Linear Combination of the rows of the G matrix They form a basis. The k rows must be linearly independent. All the lines of G are code words! For example: Example 4.3: For linear block code n = 7, k = 4, r = 3, generated by ( c0 c1 c2 c3 c4 c5 c6) ( m0 m1 m2 m3) Find all the code words

75 Here, it is linear systematic block code, since 75

76 4.2.1 Linear Systematic Block Codes Definition If the generating matrix can be written as : G [ P I k ] n x k Parity-check matrix k x n-k Redundant Checking Part n-k digits Identity matrix k x k Message Information Part k digits n bits then, a linear block code generated by such a generator matrix is called Linear Systematic Block Code. Its code words are in the form of Example 4.3 (cont d): n = 7, k = 4, r = c ( c0 c1 c2 c3 c4 c5 c6) ( m0 m1 m2 m3) 1 1 design the encoder c0 m0+ m2 + m3 c m + m + m Parity Check Bits c2 m1 + m2 + m 3 c c m 3 0 m 4 1 c5 m2 c m 6 3 Information Bits (first r bits) 76 (last k bits) ENCODING CIRCUIT

77 the encoder can be designed as Hamming Weight and Distance Hamming Distance of two code words c 1, c2 : d H ( c1 c positions in which they differ. Hamming Weight of a code word number of non-zero positions in it. It is clear, 2) c i is the number of : wh ( ci ) is the In Example 4.3 : n = 7, k = 4, r = 3, determine the Hamming weight for c ( ) 1 c ( ) 1 77

78 Minimum Hamming Distance The Minimum Hamming Distance of a linear block code is equal to the Minimum Hamming Weight of the non-zero code vectors. In Example 4.3 : n = 7, k = 4, r = 3, d min = w min = Error Detection and Correction Capacity Rules i) Detect Up to t Errors IF AND ONLY IF ii) Correct Up to t Errors IF AND ONLY IF dmin t 1 dmin 2t1 iii) Detect Up to t d Errors and Correct Up to t c Errors IF AND ONLY IF d 2t 1 and d t t 1 min c min c d In Example 4.3: n = 7, k = 4, r = 3 The minimum Hamming distance is 3, and, such, the number of errors which can be detected is 2 and the number of errors which can corrected is equal to 1. The code does not have the capability to simultaneously detect and correct errors. (see the relations between dmin and the correction/detection capability of a code). Error Vector For received vectors v c+e Error Vector No Error Ex: An error at the first bit e ( ) e ( ) 78

79 Parity Check Matrix T GH = 0 G=Generator Matrix k x n H=Parity Check Matrix n-k x n k x n-k For Systematic Code in which G P ], then [ I k k For a Code Word T T c : ch mgh 0 In Example 4.3: n = 7, k = 4, r = 3, find its parity-check matrix. From the generator matrix G in Example 4.3, mgh ch T T 0 0 ( cccmmmm) H T c0 m0m2m3 0 c1 m0 m1 m2 0 c2 m1 m2 m3 0 Parity Check Equations 79

80 Syndrome Calculation and Error Detection Syndrome is defined as: 1 x n-k s=vh T 1 x n n x n-k =0 If v=c 0 If vc In Example 4.3: n = 7, k = 4, r = 3, if There is an error, but this error is undetectable! The error vector introduces 3 error. But the minimum Hamming distance for this code is 3, and, such, a 3 error pattern can lead to another code word! Note: When we say that the number of errors which can be detected is 2, we refer at all error patterns with 2 bits in errors. However, the code is capable to detect patterns with more than 2 errors, but not all! Question: What is the number of error patterns which can be detected with this code? Answer: The total number of error patterns is 2 n -1 (the all-zero vector is not an error!). However, 2 k -1 of them lead to code words, which mean that they are not detectable. So, the number of error patterns which are detectable is 2 n -2 k. 80

81 Error Correction Capacity Likelihood Test Why and When the Minimum Hamming Distance is a Good Decoding Rule? Let c,c 1 2 be two code words and v be the received word, If c 1 is the actual code word, then the number of errors is If c 2 is the actual code word, then the number of errors is Which of these two code words is most likely based on v? The most likely code word is the one with the greatest probability of occuring with the received word, i.e., This is called the Likelihood Ratio Test. or, equivalently, Log-Likelihood Ratio Test. ln pv,c ln p 0 1 v,c2 c c

82 The joint probabilities can be further written as p p p v,ci v ci ci (i = 1,2) For the BSC Channel (independent errors) i i p Pr( t ) p (1 p) v c i i t nt (i = 1,2) where t d v, c ) is the number of errors that have occurred during the transmission of i ( i code word c i. Since there is a specific error pattern for a received word, the binomial coefficient does not appear in above. IF Condition 1: the code words have the same probability and Condition 2: p < 0.5 (p is the crossover probability of the BSC channel) By performing some calculations, one gets that: 82

83 4.2.4 Decoding Linear Block Codes Standard Array Decoder The simplest, least clever, and often most expensive strategy for implementing error correction is to simply look up c in a decoding table that contains all possible v. This is called a standard-array decoder, and the lookup table is called the Standard Array. The first word in the first column of the standard array is the zero code-word (it also means zero error). If no error, the received words are the code words. These are given in the first row of the standard array. For a linear block code (n, k), the first row contains 2 k code words, including the zero code-word. All 2 n words are contained in the array. Each row contains 2 k words. So, the number of columns is 2 k. The number of rows will then be 2 n /2 k =2 n-k =2 r. The standard array for a (7, 4) code can be seen in the table on next page. When decoding with the standard array, we indentify the column of the array where the received vector appears. The decoded vector is the vector in the first row of that column. Each row is called Coset. In the first column we have all correctable error patterns. These are called Coset Leaders. Decoding is correctly done if and only if the error pattern caused by the channel is a coset leader (including the zero-vector). The words on each column, except for the first element, which is a code word, are obtained by adding the coset leader to the code word. Question: How do we chose the coset leaders? To minimize the probability of a decoding error, the error patterns that are more likely to occur for a given channel should be chosen as coset leaders. For a BSC, an error pattern of smaller weight is more probable than an error pattern of larger weight. Therefore, when the standard array is formed, each coset leader should be chosen to be a vector of at least weight from the remaining available vectors. Choosing coset leaders this way, each coset leader will have the minimum weight in its coset. In a column, one gets the words which are at minimum distance of the code word, which is the first element of the column. A linear block code is capable to correct 2 n-k error patterns (including zero error). 83

84 84

85 Syndrome Decoder Standard array decoder becomes slow when the block code length is large. A more efficient method is syndrome decoder. Syndrome Vector is defined as: T s=vh =0 If v=c 1 x n-k 1 x n n x n-k 0 If vc where v is the received vector, H is the parity-check matrix. The syndrome is independent on the code word; It depends only on the error vector (for a specific code). All the 2 k n-tuples (n bit words) of a coset have the same syndrome. Steps in the Syndrome Decoder 1. For the received word, the syndrome is calculated by T T s vh eh e 2. The coset leader is calculated. 3. The transmitted code word is obtained by cve Syndrome Decoder 4.2.4:: Decoding Linear Block Codes 85

86 Example 4.4: Design the Syndrome decoder for Example 4.3 in which n = 7, k = 4, r = 3 For the parity-check matrix in Example 4.3 and the single-bit error pattern: 86

87 4.2.5 Hamming Codes Definition Hamming codes are important linear block codes, used for single-error controlling in digital communications and data storage systems. For any integer r 3, there exist a Hamming Code with the following parameters: Code Length: Number of information digits: Number of parity check digits: Error correction capability: Systematic Hamming code has: In Example 4.3: n = 7, k = 4, r = 3 r Code Length: Number of information digits: Number of parity check digits: n 2 17 r k 2 1r 4 r nk 3 t 1 ( d =1) Error correction capability: min Thus, the code given as example is a Hamming code. Example 4.5: Construct the parity-check matrix for the (7, 4) systematic Hamming code. Example 4.6: Write down the generator matrix for the Hamming code of Example

88 Perfect Code If we form the standard array for the Hamming code of length n 2 r 1, the n-tuples of n k r weight 1 can be used as coset leaders. Recall that the number of cosets is 2 / 2 2! That would be the zero vector and the n tuples of weight 1. Such a code is called a Perfect Code. PERFECT does not mean BEST! A Hamming code corrects only error patterns of single error and no others. Some Theorems on The Relation Between the Parity Check Matrix and the Weight of Code Words Theorem 1: For each code word of weight d, there exist d columns of H, such that the vector sum of these columns is equal to the zero vector. The reciprocal is true. Theorem 2: The minimum weight (distance) of a code is equal to the smallest number of columns of H that sum to 0. In Example 4.3: n = 7, k = 4, r = 3 The columns of H are non-zero and distinct. Thus, no two columns add to zero, and the minimum distance of the code is at least 3. As H consists of all non-zero r-tuples as its columns, the vector sum of any such two columns must be a column in H, and thus, there are three columns whose sum is zero. Hence, the minimum Hamming distance is 3. 88

89 Shortened Hamming Codes If we delete columns of H of a Hamming code, then the dimension of the new parity r check matrix, H, becomes.. Using H we obtain a Shortened Hamming r(2 1 ) Code, with the following parameters: Code Length: Number of information digits: Number of parity check digits: Minimum Hamming Distance: In Example 4.3: We shorten the code (7,4) We delete from P T all the columns of even weight, such that no three columns add to zero (since total weight must be odd). However, for the column of weight 3, there are 3 columns in I r, such that the 4 columns sum is zero. We can thus conclude that the minimum Hamming distance of the shortened code is exactly 4. This increases the error correction and detection capability. The shortened code is capable of correcting all error patterns of single error and detecting all error patterns of double errors. By shortening the code, the error correction and detection capability is increased. 89

90 Ch 5 Cyclic Codes 5.1 Description of Cyclic Codes Definition Cyclic code is a class of linear block codes, which can be implemented with extremely cost effective electronic circuits. Cyclic Shift Property A cyclic shift of c ( cc... c n c n ) is given by uin general, a cyclic shift of c can be written as A Cyclic Code is a linear block code C, with code words c ( cc c n 2c n 1 ) such that for every, the vector given by the cyclic shift of is also a code word. Example 5.1: Verify the (6,2) repetition code is a cyclic code. Since a cyclic shift of any of its code vectors results in a vector that is element of C. Check by yourself. Example 5.2: Verify the (5,2) linear block code defined by the generator matrix Its code vectors are is not a cyclic code. c C C {(000000),(111111),(010101),(101010)} G c Its code vectors are c mg 90

91 The cyclic shift of (10111) is (11011), which is not an element of C. Similarly, the cyclic shift of (01101) is (10110), which is also not a code word. Code (or Codeword) Polynomial c ( cc... c c ) 0 1 n2 n1 Code Word One to-one correspondence n2 n1 0 1 n2 n1 cx ( ) c cx... c X c X Code Polynomial of degree (highest exponent of X) n -1 or less. Theorem: The non-zero code polynomial of minimum degree in a cyclic code is unique, and is of order r. Theorem 1: A binary code polynomial of degree n -1 or less is a code word if and only if it is a multiple of g(x ). where cx ( ) mxgx ( ) ( ) degree n -1 or less m 1 degree k -1 or less degree r 0 1 k 1 k 1 c( X) ( m m X... m X ) g( X) m0,..., k are the k information digits to be encoded. An (n, k) cyclic code is completely specified by its non-zero code polynomial of minimum degree, g(x), called the generator polynomial. 91

92 Theorem 2: The generator polynomial, g(x ), of an (n, k) cyclic code is a factor of n X 1. Question: For any n and k, is there an (n, k) cyclic code? Theorem 3: If g(x ) is a polynomial of degree r = n - k and if it is a factor of then g(x ) generates an (n, k) cyclic code. n Remark: For n large, X 1 may have many factors of degree n - k. Some of these polynomials generate good codes, whereas some generate bad codes. Example 5.3: Determine the factor of X 7 +1 that can generate (7, 4) cyclic codes. For a (7,4) code, r=n-k=7-4=3, the generator polynomial can be chosen either as or Systematic Cyclic Code mx ( ) m m X... m For message: 0 systematic cyclic code includes: 1 k 1 k 1X, the steps to generate Step 1: Step 2: Step 3: n k Proof: X m( X ) a( X ) g( X ) b( X ) degree n -1 degree = n - k degree n - k -1 nk bx ( ) X mx ( ) axg ( ) ( X) Code word 0 n k 1 n k n b 1 1X... b k 1 m 0... mk 1X b X X parity check bits message 92

93 Example 5.4: Find (7, 4) cyclic code, generated by 3 m( X ) 1 X i.e. m (1001) gx ( ) 1XX 3 when Step 1: Multiply the message m(x) by X n k. Step 2: Obtain the remainder b(x) from dividing X n k m(x) by g(x). Step 3: Combine b(x) and X n k m(x) to form the systematic code word. c ( ) parity check bits kbits of the message Generator Matrix 5.2: Generator and Parity-check Matrices Let (n, k) be a cyclic code, with the Generator Polynomial Then, a code polynomial can be written as 93

94 k1 which is equivalent to the fact that span C. k x n g( X) Xg( X ) G... k 1 X g( X) g( X), Xg( X),..., X g( X) g0 g1g2... gnk g0 g1... gnk 1 gnk g0... gnk 2 gnk 1 gnk g nk with g g 1. 0 nk Systematic Generator Matrix In general, G is not in a systematic form. However, we can bring it in a systematic form by performing row operations. Reminder: For a Systematic Code G [ PI ] k Example 5.5: Determine the systematic generator matrix for (7, 4) cyclic code, generated 3 by gx ()1 XX G R1 R2 R3 R4 R3+R1 R4+R1+R2 = systematic form 94

95 The (7, 4) cyclic code, generated by g( X) 1 X X when message is (1100) 3 for other messages, the code see below: The (7, 4) cyclic code in systematic form, generated by g( X) 1 X X 3 when message is (0011) 95

ITCT Lecture IV.3: Markov Processes and Sources with Memory

ITCT Lecture IV.3: Markov Processes and Sources with Memory 4. Markov Processes Thus far, we have been occupied with memoryless sources and channels. We must now turn our attention to sources with memory.