Entropies & Information Theory

Size: px

Start display at page:

Download "Entropies & Information Theory"

Merilyn Glenn
6 years ago
Views:

1 Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on:

2 Quantum Information Theory Born out of Classical Information Theory Mathematical theory of storage, transmission & processing of information Quantum Information Theory: how these tasks can be accomplished using quantum-mechanical systems as information carriers (e.g. photons, electrons, ) Quantum Physics Quantum Information Theory Information Theory

3 The underlying quantum mechanics distinctively new features These can be exploited to: improve the performance of certain information-processing tasks as well as accomplish tasks which are impossible in the classical realm!

4 Classical Information Theory: 1948, Claude Shannon He posed 2 questions: (Q1) What is the limit to which information can be reliably compressed? relevance: physical limit on storage (Q2) What is the maximum amount of information that can be transmitted reliably per use of a communications channel? relevance: noise in communications channels information = data =signals= messages = outputs of a source

5 He posed 2 questions: Classical Information Theory:1948, Claude Shannon (Q1) What is the limit to which information can be reliably compressed? (A1) Shannon s Source Coding Theorem: data compression limit = Shannon entropy of the source (Q2) What is the maximum amount of information that can be transmitted reliably per use of a communications channel? (A2) Shannon s Noisy Channel Coding Theorem: maximum rate of info transmission: given in terms of the mutual information

6 Shannon: information What is information? uncertainty Information gain = decrease in uncertainty of an event measure of information measure of uncertainty Surprisal or Self-information: Consider an event described by a random variable (r.v.) X p( x) (p.m.f); x J (finite alphabet) A measure of uncertainty in getting outcome : ( x): log p( x) a highly improbable outcome is surprising! x log log 2 px ( ) only depends on -- not on values taken by continuous; additive for independent events x X

7 Shannon entropy = average surprisal Defn: Shannon entropy H ( X ) of a discrete r.v. X p( x), H ( X) ( ( X)) px ( )log px ( ) xj Convention: 0log0 0 lim wlog w 0 w0 log log 2 (If an event has zero probability, it does not contribute to the entropy) H ( X ) : a measure of uncertainty of the r.v. also quantifies the amount of info we gain on average when we learn the value of X X x J H ( X) H( p ) H p( x) X p ( ) X p x x J

8 Example: Binary Entropy X p( x) J {0,1} p(0) p; p(1) 1 p; H ( X) plog p(1 p)log(1 p) h( p) Properties p p 0 x 1 h( p) 0 hp ( ) 1 x0 no uncertainty p 0.5 : h( p) 1 Concave function of Continuous function of maximum uncertainty p p

9 Operational Significance of the Shannon Entropy = optimal rate of data compression for a classical memoryless (i.i.d.) information source Classical Information Source Outputs/signals : sequences of letters from a finite set J : source alphabet (i) binary alphabet J {0,1} (ii) telegraph English : 26 letters + a space (iii) written English : 26 letters in upper & lower case + punctuation J

10 Simplest example: a memoryless source successive signals: independent of each other characterized by a probability distribution On each use of the source, a letter u J pu Modelled by a sequence of i.i.d. random variables U p( u) ( ) u J emitted with prob U, U,..., U u J 1 2 n i pu ( ) PU ( u), uj1 kn. k p( u) Signal emitted by uses of the source: n u ( u, u,..., u ) u 1 2 n ( n) p( u) P( U u, U u,..., U u ) n n 1 2 p( u ) p( u )... p( u n ) Shannon entropy of the source: HU ( ): pu ( )log pu ( ) uj

11 (Q) Why is data compression possible? (A) There is redundancy in the info emitted by the source -- an info source typically produces some outputs more frequently than others: In English text e occurs more frequently than z. --during data compression one exploits this redundancy in the data to form the most compressed version possible Variable length coding: -- more frequently occurring signals (e.g e ) assigned shorter descriptions (fewer bits) than the less frequent ones (e.g. z ) Fixed length coding: -- identify a set of signals which have high prob of occurrence: typical signals -- assign unique fixed length (l) binary strings to each of them -- all other signal (atypical) assigned a single binary string of same length (l)

12 Typical Sequences Defn: Consider an i.i.d. info source : 0, U, U,... U ; p( u) ; u J 1 2 For any sequences 1 2 n n u: ( u, u,... u ) J for which n( H( U) ) n( H( U) ) 2 pu ( 1, u2,... u n ) 2, where HU ( ) are called typical sequences n Shannon entropy of the source ( n T ) : typical set = set of typical sequences Note: Typical sequences are almost equiprobable u ( n T ), pu ( ) 2 nh U ( )

13 Operational Significance of the Shannon Entropy (Q) What is the optimal rate of data compression for such a source? [ min. # of bits needed to store the signals emitted per use of the source] (for reliable data compression) Optimal rate is evaluated in the asymptotic limit n n number of uses of the source One requires p ( n) error 0 ; n (A) optimal rate of data compression = HU ( ) Shannon entropy of the source

14 Compression-Decompression Scheme U, U,... U ; U p( u) ; u J Suppose is an i.i.d. information 1 2 n i Shannon entropy source A compression scheme of rate When is this a compression scheme? Decompression: Average probability of error: H ( U ) R : En : u: ( u1, u2,... u n ) 1 2 n J mn D n Compr.-decompr. scheme reliable if m : 0,1 n ( n) p av x : ( x, x,... x mn ) ( n) av nr J n m 0,1 n pup ( ) Dn( En( u)) u u p 0 as n

15 Shannon s Source Coding Theorem: U, U,... U ; U p( u) ; u J Suppose is an i.i.d. information 1 2 n i Shannon entropy source H ( U ) R Suppose : then there exists a reliable compression scheme of rate H( U) R for the source. R H( U) If then any compression scheme of rate will not be reliable. R

16 Entropies for a pair of random variables Consider a pair of discrete random variables X p( x) ; x JX and Y p( y) ; y JY Given their joint probabilities & their conditional probabilities P( X x, Y y) p( x, y) ; P( Y y X x) p( y x) ; Joint entropy: H( X, Y): p( x, y) log p( x, y) xj X yj Y Conditional entropy: HY ( X): pxhy ( ) ( X x) xj X xj X px ( ) py ( x) log py ( x) yjy xj X yj Y HY ( X) 0 p( xy, ) log py ( x)

17 Entropies for a pair of random variables Relative Entropy: Measure of the distance between two probability distributions p p( x) ; q q( x) xj p( x) D( p q): p( x)log xj qx ( ) xj convention: 0 u 0log 0 ; ulog u 0 u 0 D( p q) 0 D( p q) 0 if & only if p q BUT not a true distance not symmetric; does not satisfy the triangle inequality

18 Entropies for a pair of random variables Mutual Information: Measure of the amount of info one r.v. contains about another r.v. X p( x), Y p( y) pxy (, ) I( X, Y): p( x, y)log xy, p( x) p( y) I( X : Y) D( p p p ) XY X Y p pxy (, ) ; p px ( ) ; p py ( ) XY xy, X x Y y Chain rules: H( X, Y) H( Y X) H( X) I( X : Y) H( X) H( Y) H( X, Y) H( X) H( X Y) HY ( ) HY ( X)

19 Properties of Entropies Let X p( x), Y p( y) be discrete random variables: Then, H( X) 0, with equality if & only if is deterministic H ( X) log J, if x J Subadditivity: H( X, Y) H( X) H( Y), p X Concavity: if & are 2 prob. distributions, H ( p (1 ) p ) H( p ) (1 ) H( p ), X Y X Y HY ( X) 0, or equivalently H( X, Y) H( Y), p Y X I( X : Y) 0 with equality if & only if & X are independent Y

20 So far. Classical Data Compression: answer to Shannon s question (Q1) What is the limit to which information can be reliably compressed? (A1) Shannon s Source Coding Theorem: data compression limit = Shannon entropy of the source 1 st Classical entropies and their properties

21 Shannon s 2 nd question (Q2) What is the maximum amount of information that can be transmitted reliably per use of a communications channel? The biggest hurdle in the path of efficient transmission of info is the presence of noise in the communications channel Noise distorts the information sent through the channel. input noisy channel output output input To combat the effects of noise: use error-correcting codes

22 To overcome the effects of noise: instead of transmitting the original messages, -- the sender encodes her messages into suitable codewords -- these codewords are then sent through (multiple uses of) the channel Alice Bob Alice s message encoding En codeword input n N ( n) uses of N output decoding Dn Bob s inference Error-correcting code: C : ( E, D ): n n n

23 The idea behind the encoding: To introduce redundancy in the message so that upon decoding, Bob can retrieve the original message with a low probability of error: The amount of redundancy which needs to be added depends on the noise in the channel

24 Memoryless binary symmetric channel (m.b.s.c.) 0 1 p p 1-p 1-p Encoding: it transmits single bits effect of the noise: to flip the bit with probability p codewords the 3 bits are sent through 3 successive uses of the m.b.s.c. Suppose m.b.s.c. codeword Example Decoding : (majority voting) Repetition Code (Bob receives) (Bob infers)

25 Probability of error for the m.b.s.c. : without encoding = p with encoding = Prob (2 or more bits flipped) := q inference possible inputs output of 3 uses of a m.b.s.c. 010 Prove: q < p if p < 1/2 -- in this case encoding helps! (Encoding Decoding) : Repetition Code.

26 Information transmission is said to be reliable if: -- the probability of error in decoding the output vanishes asymptotically in the number of uses of the channel Aim: to achieve reliable information transmission whilst optimizing the rate the amount of information that can be sent per use of the channel The optimal rate of reliable info transmission: capacity

27 Discrete classical channel N J X input alphabet; JY output alphabet n uses of N input ( n) x ( n ) n x J X N ( n) output ( ) ( ) ( n n py x ) ( n) ( n) y y J Y n py ( n) ( n) ( x ) conditional probabilities ; known to sender & receiver

28 Correspondence between input & output sequences is not 1-1 x ( n ) n J X y ( n ) n J Y ( ) x ' n Shannon proved: it is possible to choose a subset of input sequences-- such that there exists only : 1 highly likely input corresponding to a given input Use these input sequences as codewords

( n) uses of x ( x1, x2,..., xn ); ( n) y ( y, y,.

29 Transmission of info through a classical channel Alice s Alice M : m M Alice s message codeword: output: finite set of messages encoding En ( n) ( n) x input N noisy channel n N ( n) uses of x ( x1, x2,..., xn ); ( n) y ( y, y,..., y ); 1 2 n N ( n) y output N ( n ) : decoding Dn p y Bob m M Bob s inference ( n) ( n) ( x ) Error-correcting code: C : ( E, D ): n n n

30 m M Alice s message If m m encoding En ( n) x input ( n) N then an error occurs! ( n) y output decoding Info transmission is reliable if: Prob. of error 0 Dn as m M Bob s inference n Rate of info transmission = number of bits of message transmitted per use of the channel Aim: achieve reliable transmission whilst maximizing the rate Shannon: there is a fundamental limit on the rate of reliable info transmission ; property of the channel Capacity: maximum rate of reliable information transmission

31 Shannon in his Noisy Channel Coding Theorem: -- obtained an explicit expression for the capacity of a memoryless classical channel n ( n) ( n) ( ) ( i i) i1 p y x p y x Memoryless (classical or quantum) channels action of each use of the channel is identical and it is independent for different uses -- i.e., the noise affecting states transmitted through the channel on successive uses is assumed to be uncorrelated.

32 Classical memoryless channel: a schematic representation X px ( ) N input x output y x J, p( y x) X y J Y, Y channel: a set of conditional probs. p( y x) Capacity input distributions C( N ) max I( X : Y) px ( ) I( X : Y) H( X) H( Y) H( X, Y) Shannon Entropy mutual information H( X) px ( )log px ( ) x

33 Shannon s Noisy Channel Coding Theorem: For a memoryless channel: X px ( ) input N p( y x) Y output Optimal rate of reliable info transmission capacity C( N ) max I( X : Y) px ( ) Sketch of proof

34 Supposedly von Neumann asked Shannon to call this quantity entropy, saying You should call it entropy for two reasons; first, the function is already in use in thermodynamics; second, and most importantly, most people don t know what entropy really is, & if you use entropy in an argument, you will win every time.

35 Relation between Shannon entropy & thermodynamic entropy S : klog th Suppose the microstate occurs with prob.. consider replicas of the system th Then on average replicas are in the state Total # of microstates Th.dyn.entropy of compound system of replicas v r vr [ vpr] v v! v v v v v v v v1 v2 vr!!..! r 1 2 S kv p log p v r r r r (by Stirling s) S S / v k p log p k H p r Shannon v r r r v total # of microstates r p r entropy

Entropies & Information Theory

Entropies & Information Theory Etropies & Iformatio Theory LECTURE I Nilajaa Datta Uiversity of Cambridge,U.K. For more details: see lecture otes (Lecture 1- Lecture 5) o http://www.qi.damtp.cam.ac.uk/ode/223 Quatum Iformatio Theory