Classical Capacities of Quantum Channels

Size: px

Start display at page:

Download "Classical Capacities of Quantum Channels"

Ophelia Knight
5 years ago
Views:

1 Classical Capacities of Quantum Channels Jens Christian Bo Jørgensen Supervisor: Jan Philip Solovej Thesis for the Master degree in Mathematics Institute for Mathematical Sciences University of Copenhagen Denmark February 4, 2008

2 i Preface The work in this master thesis was carried out during the period February 2007 to February The initial idea to study the classical capacities of quantum channels was inspired by a talk given by Nilanjana Datta during my participation in a the spring school on Theoretical and Technological Perspectives on Quantum Information and Communication in Marseilles. The project later evolved into a review of many results relating to classical capacities of quantum channels, including the Holevo Conjecture. This thesis is intended for an audience of other master students in mathematics or physics. The prerequisites for reading this paper consist of a basic knowledge of quantum mechanics and bachelor level mathematics. I would like to take the opportunity to thank Jan Philip Solovej for careful and patient supervising and numerous interesting discussions. I would also like to thank Mary Beth Ruskai and Chris King for taking their time to discuss quantum information theory with me during my visit in Boston in the fall For meticulous proofreading and critical comments I would like to thank Jonatan Brask Bohr. Thank you also to my father for correcting my English.

3 Contents 1 Introduction The Holevo Additivity Conjecture Representation of Quantum Channels 10 3 Capacity Notation Capacity of a classical channel The Classical Capacity of a Quantum Channel Capacity Relations The HSW Theorem Shannon Capacity Convexity Theory General Convexity Results The Set of States The Holevo Capacity - Reconsidered Equivalence of Additivity Conjectures 63 7 Towards a Proof or a Counterexample Winter s proof Conclusion and Outlook 87 A Properties of the Von Neumann Entropy 89 B Channel Extension Computation 91 C Equality of Distributions 93 D Explicit form of gaussian rate function 94 Bibliography 96 ii

4 Introduction 1 1 Introduction A central issue in information theory as well as in quantum information theory is to understand ultimate rates of communication. Suppose two parties, Alice and Bob, are connected by some sort of communication channel, and we ask the question: How much information can Alice transmit to Bob per use of the channel? The answer of course depends on the nature of the channel, and on what is meant by "information". In classical information theory a simple model of communication, the so-called classical noisy channel, has been widely studied. In a noisy channel each input is subject to random noise, so that Alice only knows with some probability what message Bob will receive. Noisy channels are all around us. Any type of digital data transmission is subject to noise due to losses in wires, poor weather conditions and so on. These sources of noise can effectively be modeled as being random. For noisy channels the question above has been answered satisfactorily. Shannon s famous Noisy Channel Coding Theorem provides a formula for the capacity of a noisy channel. The capacity is the ultimate rate of asymptotically perfect transmission of information by n independent uses of the channel, as n. For a concrete channel, Shannon s formula allows for a relatively easy calculation of the capacity; at least numerically. Now suppose Alice and Bob are connected by a quantum channel. A quantum channel is a very general model of a device capable of changing and transmitting quantum states. It is a generalization of the classical noisy channel. For a concrete example, think of an optical cable connecting Alice and Bob, through which individual photons are send. Alice can prepare the individual states of the photons, and Bob can make measurements on the received photons in accordance with the laws of quantum mechanics. On their way through the channel, the states of the photons change due to interactions with the channel and due to noise. How much classical 1 information can Alice send to Bob per use of the channel, that is, per photon? In other words, what is the classical capacity of the quantum channel. Though this question was first posed in the early days of quantum information theory, nearly forty years ago, it still remains open. In the quantum world things are not so simple. There is not one, but (at least) four capacities to consider for a quantum channel. Answering the question above amounts to showing how these four capacities relate to each other. Much progress was made around the turn of the millennium, when it was discovered that two of these four capacities are in fact identical and a (relatively) simple formula was provide for a third capacity, called the Holevo capacity. However, it is still not fully understood how all the capacities relate to each other. To this end, the main open question left is the additive of the Holevo capacity. We will see precisely what this means in Section 1.1. If the Holevo capacity is additive, the four capacities collapse into two, both of which 1 There exist purely quantum mechanical measures of information in the literature. We shall not consider them here. Everywhere in this paper "information" will mean classical information, as introduced by Claude Shannon.

5 2 Introduction are given by (relatively) simple formulas. If not, the capacity question remains open. In this paper we will review some of the major result concerning the classical capacity of a quantum channel and provide a peek into some of the latest research on the issue. The paper consist of two parts. In the first part, which comprises Chapter 1-4, we study quantum channels and define the four different capacities in terms of information transmission rates. These capacity measures are then related to each other, and we provide a formula for each of them. Some questions are necessarily left open as they depend on the Holevo Conjecture. The second part of this paper comprises Chapter 5-7 and focuses on the Holevo Conjecture. We begin with an outline of useful tools from convex analysis. These tools are then applied to the Holevo capacity to provide five equivalent formulations of the Holevo Conjecture. Finally, we close the paper with a presentation of a very recent proof strategy for the Holevo Conjecture, which is based on one of these formulations. The structure of the thesis is displayed in Figure 1. Classical Capacity The Holevo Conjecture 2. Rep. of Quantum Channels 5. Convexity Theory 3. Capacity Measures 6. Equivalence of Additivity Conjectures 4. Capacity Relations 7. Towards a proof or or a Counterexample Figure 1 The structure of the thesis. 1.1 The Holevo Additivity Conjecture In this introductory chapter we will take the beeline to the Holevo conjecture. We will define the quantities necessary to understand the precise statement of the conjecture, but leave out motivation and physical interpretation for later chapters. The objective is, that the reader quickly gets to the mathematical kernel of this paper. This section was inspired by [1].

6 1.1. The Holevo Additivity Conjecture 3 The Classical Noisy Channel Let us begin with a definition of a classical noisy channel or stochastic map. Definition 1.1 (Classical Channel) Let X and Y be finite sets. A stochastic map is a map Φ: X Y [0,1] satisfying, Φ(x,y) = 1,x X (1) y Y In this setting we call the sets alphabets, and the elements they contain letters Alice X x 1 x 2 x 3 x 4 x 5 Φ P(y x) y 1 y 2 y 3 y 4 Bob Y x 6 Figure 2 Given the input letter x from Alice, the channel transmits the letter y to Bob with probability P(y x). We should think of Φ as the mathematical model of a real physical channel. The idea is that Alice is placed at one end of the channel equipped with the alphabet X and Bob at the other end equipped with the alphabet Y. Now Alice wants to send messages to Bob, but the channel does not (necessarily) transmit letters faithfully. When Alice transmits the letter x Bob receives the letter y with probability Φ(x, y). By definition, for any x X, the map y Φ(x, y) is a probability distribution. Sometimes we shall write the stochastic map as P(y x) = Φ(x, y) emphasizing that we think of P(y x) as the conditional probability of receiving y, given that x was sent. Many noisy channels appearing in real life can be modeled by a stochastic map. Data transmission from one mobile phone to another is an example. On the way from sender to receiver the data is exposed to many sources of noise such as bad weather conditions, losses in wires, a bird flying into the broadcasting mast etc 2. The effect of all this noise can be modeled by choosing an appropriate probability distribution Φ(x, ) for each x. Given a channel Φ, and an input probability distribution π on X, the channel transforms π into an output probability distribution π = Φπ on Y, given by π (y) = x X Φ(x,y)π(x). Let δ x, x X denote the degenerate probability distribution on X, i.e. { 1 for k = x δ x (k) =. (2) 0 else 2 The reason we can actually send messages from one mobile phone to another is due to clever error correcting codes. Essentially, the probability of error is reduced by introducing redundancies in the messages that are to be transmitted. As a simplified example: The message "hhhhhhhhhhiiiiiiiiii" is sent in stead of "hi".

7 4 Introduction A channel Φ for which we for any x X have Φδ x = δ yx for some y x Y is noiseless. Denote by P(X) = {π: π(x) 0, x X π(x) = 1}, (3) the convex set of probability distributions on X. In a convex set K, a point x is called extreme if it does not lie in the open line segment denoted by ]a,b[ for any points a, b K. The set of extreme points of K is called the extreme boundary, and is denoted by ext K. If any point x K can be written uniquely as a convex combination of extreme points, the set K is called a simplex. We will return to convexity theory in Chapter 5. It is straight forward to check that the extreme points of the set P(X) are the degenerate probability distributions and that P(X) is a simplex. The channel Φ n input a 1 a 2 a 3 Φ Φ Φ Φ n output b 1 b 2 b 3 a 4 Φ b 4 Proposition 1.2 For a direct product X 1 X 2 of two alphabets, extreme points of P(X 1 X 2 ) are precisely the product of extreme points of P(X 1 ) and P(X 2 ), i.e., ext P(X 1 X 2 ) = ext P(X 1 ) ext P(X 2 ) (4) Proof. Follows immediately from δ (x1,x 2 ) = δ x1 δ x2. When Alice wants to send n letters to Bob she uses the physical channel n-times. In the most general situation any use of the physical channel may depend on previous uses of the channel. However, we will assume that the physical channel is "memoryless", so that this is not the case. How do we model multiple uses of a channel? Consider two channels Φ i : X i Y i [0,1], i = 1,2. We define the channel product Φ 1 Φ 2 : (X i X 2 ) (Y i Y 2 ) [0,1], by Φ 1 Φ 2 (x 1,x 2,y 1,y 2 ) = Φ 1 (x 1,y 1 )Φ 2 (x 2,y 2 ). (5) It is easy to verify that Φ 1 Φ 2 is a stochastic map and that is associative. The product on the right-hand-side of (5) is exactly the "memoryless"-assumption. Then n uses of the physical channel correspond to applying the channel Φ } {{ Φ }. n

8 1.1. The Holevo Additivity Conjecture 5 According to Shannon s Noisy Channel Coding Theorem, the capacity of a quantum channel is given by C(Φ) = sup { H(Φπ) π(x)h(φδ x ) }. (6) π P(X) The term in curly brackets is called the Shannon mutual information of the probability distributions π and Φπ and H(π) = x x π(x)log π(x), (7) is the Shannon entropy. Here, and in the following, log will always denote the base 2 logarithm. We have lim x 0 xlog x = 0, so define 0log 0 := 0. Then x xlog x is continuous on [0, ). In Chapter 3 we will see how to define the capacity in terms of data transmission, but for now we can take (6) either on faith or as a temporary definition. We can view P(X) as a compact subset of the finite dimensional vector space R X and P(Y) as a compact subset of R Y. The map π Φπ is continuous, since it can be extended to a linear map R X R Y. Furthermore the entropy function H : P(Y) [0,1] is continuous. Thus the mutual information H(π : Φπ) = H(Φπ) x π(x)h(φδ x ), (8) is a continuous function on P(X). Hence it follows that { C(Φ) = max H(Φπ) π(x)h(φδ x ) }, (9) π P(X) i.e. the supremum in the definition of C is in fact a maximum. A probability distribution attaining the max is called optimal. The capacity has the following property. Proposition 1.3 (Additivity of classical capacity) For two stochastic maps Φ 1 and Φ 2 we have, C(Φ 1 Φ 2 ) = C(Φ 1 ) + C(Φ 2 ). (10) In colloquial terms, this proposition says that two independent channels viewed as one channel can transmit as much information as the sum of what the individual channels are capable of transmitting. This is what one would expect. Proof. " "(superadditivity). Observe that H(σ 1 σ 2 ) = H(σ 1 ) + H(σ 2 ) for any two probability distributions. It is also straightforward to see that Φ 1 Φ 2 (δ x1 δ x2 ) = Φ 1 δ x1 Φ 2 δ x2, for x i X i, i = 1,2, simply by unfolding definitions. Now let π i be the optimal distribution for Φ i according to (9). Then we get C(Φ 1 Φ 2 ) H(π 1 π 2 : Φ 1 Φ 2 π 1 π 2 ) = H(Φ 1 Φ 2 π 1 π 2 ) x 1,x 2 π 1 (x 1 )π 2 (x 2 )H(Φ 1 Φ 2 δ (x1,x 2 )) = H(Φ 1 π 1 ) + H(Φ 2 π 2 ) x 1,x 2 π 1 (x 1 )π 2 (x 2 ) [ H(Φ 1 π 1 ) + H(Φ 2 δ x2 ) ] x = [ H(Φ 1 π 1 ) x 1 π 1 (x 1 )H(Φ 1 π 1 ) ] + [ H(Φ 2 π 2 ) x 2 π 1 (x 2 )H(Φ 1 π 2 ) ] = C(Φ 1 ) + C(Φ 2 ).

9 6 Introduction " "(subadditivity). Let σ be a distribution on Y 1 Y 2. Let σ 1 and σ 2 be the marginal distributions on Y 1 and Y 2, respectively. That is, σ 1 (y 1 ) = y 2 σ(y 1,y 2 ) and σ 2 (y 2 ) = y 1 σ(y 1,y 2 ). The entropy is subadditive, meaning that H(σ) H(σ 1 ) + H(σ 2 ). Let π a distribution attaining the maximum in (9) and let π i, i = 1,2 be the marginal distributions. Then C(Φ 1 Φ 2 ) = H(Φ 1 Φ 2 π) x π(x)h((φ 1 Φ 2 )δ x ) H(Φπ 1 ) + H(Φ 2 π 2 ) π(x 1,x 2 )H((Φ 1 Φ 2 )δ x1 δ x2 ) x 1,x 2 = H(Φπ 1 ) + H(Φ 2 π 2 ) π(x 1,x 2 ) [ H(Φ 1 δ x1 ) + H(Φ 2 δ x2 ) ] x 1,x 2 = [ H(Φ 1 π 1 ) π 1 (x 1 )H(Φ 1 π 1 ) ] + [ H(Φ 2 π 2 ) π 1 (x 2 )H(Φ 1 π 2 ) ] x 1 x 2 C(Φ 1 ) + C(Φ 2 ). In particular we have that C(Φ n ) = nc(φ). The Quantum Channel According to basic quantum mechanics every physical system comes associated with a Hilbert space H, in such a way that the states of the physical system are described by the set of density matrices D(H) = {ϱ B(H): tr(ϱ) = 1,ϱ 0}. (11) The positivity requirement ϱ 0 means that x ϱ x 0 for any unit vector x H. The space B(H) is itself a Hilbert space when equipped with the Hilbert-Schmidt inner product a,b = tr(ab ), for a,b B(H). The space D(H) is easily seen to be convex and it is not hard to show that the set is bounded in the norm induced by the inner product. Hence D(H) is also compact. Any time evolution of a system in some state ϱ A D(H) to a state ϱ B D(H B ), where dim H A,dim H B < can be described by a so-called quantum channel. We will consider quantum channels more closely in Chapter 2. A quantum channel is defined as follows. Definition 1.4 (Quantum Channel) Let Φ : B(H 1 ) B(H 2 ) be a linear map, where B(H i ) is the space of bounded operators on the (finite) dimensional Hilbert spaces H i, for i = 1,2. If Φ is completely positive and trace preserving (CPT), it is called a quantum channel or a CPT map. That is, the linear map Φ must satisfy: 1. (trace preserving) tr Φ(ϱ) = tr(ϱ) for all ϱ B(H 1 ). 2. (completely positive). For any auxiliary Hilbert space V, the map I Φ : B(V H 1 ) B(V H 2 ), where I : B(V ) B(V ) is the identity map, must be positive. That is, (I Φ)(A) 0 for A 0. Remark 1.5 Note that the domain and range of a quantum channel are vector spaces of the form B(H). This is a matter of convenience rather than a reflection of physical

10 1.1. The Holevo Additivity Conjecture 7 reality. The states of a system are objects in the much smaller convex set D(H) B(H). A completely positive map is a map satisfying the last criteria above, but which is not necessarily trace preserving. We shall consider such maps in Chapter 6. As for the classical channels, we can consider the capacity of a quantum channel. In the quantum world things are not so simple. There are many types of capacities depending on which restrictions we put on the transmission of information, or even depending on what we mean by information. In the rapidly expanding field of quantum information, attempts are being made to introduce a purely quantum mechanical notion of information. We shall not consider this here. Instead we will restrict ourself to the simpler information transmission protocol in which Alice sends classical information to Bob, but using a quantum channel. That is, first Alice must encode her information in quantum states. The states are then transmitted to Bob using the quantum channel. It is up to Bob to decode the received quantum states in order to retrieve the original message. Capacities for this type of protocol are called classical capacities for quantum channels. Even in this simpler situation, there are (at least) four different capacity measures to consider depending on what restrictions we put on the coding and decoding methods Alice and Bob use. In Chapter 4 we investigate the various relations among these capacities. As part of this, we will prove that two of these seemingly different measures are in fact identical. We will not be able to provide a full description of the capacity relations as this is a subject of current research. One of the four (or really three) capacity measures has emerged as particular important. This is the Holevo capacity χ. It has been conjectured that the Holevo capacity is additive, precisely as was the case for the classical capacity. The conjecture has been standing for nearly 10 years. If additivity is proved to hold, another two of the capacity measure will collapse into one, leaving us with only two capacity measures. Among the two remaining capacities, the Holevo capacity will be the more interesting, since it measure the highest obtainable classical capacity for a quantum channel. The other capacity measure pertains to the restrictive information transmission protocol in which entangled codewords are not allowed, or equivalently in which the quantum channel is used as a classical channel. What is also important is, that the Holevo capacity is given by a formula, which can be explicitly computed for concrete channels - at least numerically (though not as simply as with Shannon s formula). In this respect, a proof of the additivity conjecture is the missing link in the description of the classical capacity of a quantum channel. The Holevo Capacity An ensemble E = (p i,ϱ i ) in D(H) is a finite collection of states ϱ i together with a probability ditribution i p i. The quantum analogue of (6) is the Holevo capacity [ χ(φ) = sup S( p i Φ(ϱ i )) p j S(Φ(ϱ j )) ]. (12) E i j The supremum here is over all ensembles in D(H) and S(ϱ) = tr(ϱlog ϱ), (13) is the Von Neumann entropy. Any density matrix ϱ has a unique spectral decomposition ϱ = n i=1 λ i x i x i, where λ i are the eigenvalues and { x i } i is an orthonormal basis.

11 8 Introduction By definition then ϱlog ϱ = n λ i log(λ i ) x i x i,. i=1 where by definition λ i log(λ i ) = 0 if λ i = 0. Hence S(ϱ) = H({λ i }). Consider the Holevo quantity defined by K(E) = S( i p i ϱ i ) j p j S(ϱ j ). (14) For a channel Φ: B(H) B(K) we let ΦE denote the ensemble (p i,φ(ϱ i )) in D(K). The Holevo capacity of the channel Φ can then be written as where again the supremum is over ensembles in D(H). χ(φ) = supk(φe), (15) E A state ϱ in D(H) is said to be pure if ϱ = x x for some unit vector x H. The set of pure states form the extreme boundary of the set D(H). This we show in Chapter 5. In this chapter we also show that the Holevo capacity is given by χ(φ) = max {p j,ϱ j } k j=1 ϱ j pure [ S( p i Φ(ϱ i )) i j p j S(Φ(ϱ j )) ]. (16) That is, the supremum has been replaced with a maximum over ensembles containing k pure states. Here k is a fixed number only depending on the dimension of dim H A. It can be shown that k = (dim H A ) 2, but we shall not prove nor use this result. For two quantum mechanical systems A 1 and A 2 with possible states D(H A1 ) and D(H A2 ) the axioms of quantum mechanics dictate that the possible states of the combined system A 1 A 2 are D(H A1 H A2 ). For two quantum channels Φ i : B(A i ) B(B i ), with i = 1,2, consider the channel Φ 1 Φ 2 : B(H A1 H A2 ) B(H B1 H B2 ), uniquely defined by its action on product states, Φ 1 Φ 2 (ϱ 1 ϱ 2 ) = Φ 1 (ϱ 1 ) Φ 2 (ϱ 2 ), (17) for all ϱ i B(H Ai ), with i = 1,2. As for the classical channel, this definition, assumes that the channels Φ 1 and Φ 2 are independent. The additivity conjecture is the following. Conjecture 1.6 For all quantum channels Φ 1 and Φ 2 the Holevo capacity is additive, i.e., χ(φ 1 Φ 2 )? = χ(φ 1 ) + χ(φ 2 ). (18) Now let us mimic the proof of Proposition 1.3 with the classical capacity replaced by the Holevo capacity, and see where it breaks down. Incomplete proof of additivity. " "(superadditivity). Let E 1 = {p i,ϱ i } i and E 2 = {q j,τ j } j be ensembles attaining the maximum in 16, and let E 1 E 2 denote the ensemble

12 1.1. The Holevo Additivity Conjecture 9 {p i q j,ϱ i τ j } i,j. Then we have, χ(φ 1 Φ 2 ) S( i,j p i q j Φ 1 Φ 2 (ϱ i τ 2 )) i,j p i p j S(Φ 1 Φ 2 (ϱ i τ j )) = S( i p i Φ 1 ϱ i ) + S( j q j Φ 2 τ i )) i p i S(Φ 1 (ϱ i )) j q j S(Φ 2 (τ j )) = χ(φ 1 ) + χ(φ 2 ). Here we have used the property of the Von Neumann entropy that H(ϱ τ) = H(ϱ) + H(τ). This proves that χ is superadditive. " "(subadditivity). The Von Neumann entropy is subadditive, meaning that for ϱ 12 D(H 1 H 2 ) we have S(ϱ) S(ϱ 1 ) + S(ϱ 2 ), where ϱ 1 = tr 2 ϱ, ϱ 2 = tr 1 ϱ, where tr i indicates a trace over H i for i = 1,2. Let E = {p i,ϱ i } i be an optimal ensemble in D(H A1 H A2 ) for χ(φ 1 Φ 2 ). Using the subadditivity property we get χ(φ 1 Φ 2 ) = S( p i Φ 1 Φ 2 (ϱ i )) i i [ S( p i Φ 1 (ϱ i1 )) + S( i i p i S(Φ 1 Φ 2 (ϱ i )) ] p i Φ 2 (ϱ i2 ))? The question mark is where the proof breaks down. Looking back at the proof for the additivity of classical capacity, we used at this particular step that ext P(X 1 X 2 ) = ext P(X 1 ) ext P(X 2 ), (19) to write δ x = δ x1 δ x2. In the quantum case we have, ext(d(h 1 H 2 )) ext D(H 1 ) ext D(H 2 ). (20) (Se Chapter 5). In other words, there exists states ϱ D(H 1 H 2 ), which cannot be written as ϱ = ϱ 1 ϱ 2, for any choice of ϱ i D(H i ), i = 1,2. Such states are called entangled. The existence of entangles states is a simple consequence of the axioms of quantum mechanics, but the implications of the existence are vast and not well understood. In fact, the whole field of quantum information theory essentially deals with understanding entanglement as a resource for fast communication. Our "failed" proof above is an indication that more sophisticated methods need to be taken into account when attacking the additivity conjecture. In Chapter 7 we will consider some of the latest developments on the conjecture including a possible proof strategy.

13 10 Representation of Quantum Channels 2 Representation of Quantum Channels The quantum channel is the mathematical model describing the evolution of a quantum system with finitely many degrees of freedom. Suppose that the initial state of our system A is ϱ A and that this state is not coupled with the environment. This means that the joint state of the system and the environment has the form ϱ A ϱ E, where ϱ E is the state of the environment. According to the laws of quantum mechanics, in the Schrödinger picture, the system will undergo unitary evolution as time progresses. Thus, at a later time the joint state will be U(ϱ A ϱ E )U, for some unitary matrix U. The matrix U depends on the nature of the physical system involved. The end state of system A is thus tr E (U(ϱ A ϱ E )U ), where we have traced out the environment. Now in general we may be interested in the end state of some subsystem B of the joint system, and not just the system A. Therefore, the most general type of maps we will be concerned with are of the form Φ: B(H A ) B(H B ), with Φ(ϱ) = tr F (U(ϱ ϱ E )U ), (21) where ϱ E B(H E ) is a density matrix, U B(H A H E ) is a unitary operator, and H A, H B, H E, H F are finite dimensional Hilbert spaces s.t. H A H E = H B H F. Figure 3 displays the various Hilbert spaces involved in this definition. We will temporarily call a map of the form (21) a Stinespring map after W. F. Stinespring who considered maps of this sort in the 1950 s. time evolution, U E A Φ F B Figure 3 The environment represented by systems E and F participates in the time evolution of the total system. The quantum channel is obtained by tracing out the environment. It turns out that a Stinespring map and a quantum channel are in fact the same object. Further more, a quantum channel has a nice operator-sum representation due to K. Kraus and M. D. Choi. We collect these results in the following theorem. Theorem 2.1 (Representation of Quantum Channels) The following are equivalent for a map Φ: B(H A ) B(H B ): (a) Φ is a quantum channel (b) Φ is a Stinepring map

14 Representation of Quantum Channels 11 (c) Φ(τ) = i E iτe i (finite sum) for all τ B(H A) and for some operators E i : H A H B, s.t. i E i E i = I. Remark 2.2 Note that the order of E i and E i is interchanged in the two sums in (c). The operator elements are far from unique. It can be shown that two lists of elements {E 1,...,E m } and {F 1,...,F n } give rise to the same quantum operation when E i = m j=1 v ijf j for i = 1,...,n, where v is an n m matrix s.t. v v = I m and vv = I n. Proof. (a) (c). Suppose Φ is a quantum channel. We will find an operator-sum representation of Φ. Let H R be a Hilbert space with the same dimension as H A and let i R and i A be orthonormal bases for the spaces. Define a state α H R H A by α = i A i R. i And put σ = (I R Φ)( α α ), where I R is the identity operator on B(H R ). Note that σ 0 since Φ is CP. In particular we can diagonalize it and write σ = j s j s j for some (not necessarily unit length) vectors s i H R H B. Define linear operators E i : H A H B by E i ( j A ) = j R s i. Let us check that these operators fit the bill in (c). We have E i k A l A E i = k R s i s i l R i i = k R σ l R = k R (I Φ)( α α ) l R = k R i R j R Φ( i A j A ) l R i,j = Φ( k A l A ), for any k,l. By linearity it follows that Φ(τ) = i E iτe i, for any τ B(H A). Finally we need to show that i E i E i = I. Since Φ is trace preserving, we must have 1 = tr(φ(τ)) = tr( i E i τe i) = tr( i E i E iτ), (22) for all density matrices τ D(H A ). The operator L = i E i E i on B(H A ) is positive, so we can diagonalize. Let Λ = diag(λ 1,...,λ a ) be the diagonal matrix representing L w.r.t. the basis of eigenvectors with a = dim H A. By (22) it follows that tr(λx) = 1, for any density matrix x B(C a ). In particular for x = diag(x 1,...,x a ) we get that λ i x i = 1, with λ i,x i 0 and i x i = 1. It follows that Λ = I n and hence i E i E i = I, where these are the unit matrix and operator, respectively. This proves (c). Showing (c) (a) is straightforward. To see that a map Φ on the form (c) is trace preserving is essentially the calculation in equation (22). Complete positivity follows by noting that (I Φ)(σ) = i (I E i) σ(i E i ), which is easily seen to be a positive map.

15 12 Representation of Quantum Channels (c) (b). Let Φ: B(H A ) B(H B ) be of the form Φ(τ) = i E iτe i, with I = i E i E i. Put a = dim(h A ) and b = dim(h B ). Then, as we just showed also (a) holds. By the construction done in the proof of (a) (c) it is seen that we can represent Φ using exactly a operator elements (this observation is the reason why we did the seemingly superfluous part (c) (a). Let H E be a Hilbert space of the same dimension as H B and let H F be a Hilbert space of the same dimension as H A. Let i F be a basis of H F, where i = 1,...,a. The task is to construct a unitary operator U : H A H E H B H F s.t. Φ(τ) = tr F (U(τ ϱ E )U ), (23) where ϱ E is a density matrix in B(H E ). Let { j E } be a basis of H E, and put ϱ E = 1 E 1 E. We can rewrite (23) as Φ(τ) = i i F U 1 E τ 1 E U i F. We are done, if we can construct U s.t. E i = i F U 1 E for all i. To understand this requirement, consider the matrix representation of U w.r.t the basis { j A k E } for the domain and the basis { l B m F } for the range 3. The requirement then translates into U being on the form [E 1 ]..... [E 2 U = ] , [E a]..... where [E i ] is the matrix representation of E i w.r.t the basis { j A } for the domain, and { l B for the range. Note that only the first block column is determined by the requirement. Furthermore, the assumption that i E i E i = I says that the first b columns of U consist of mutually orthogonal normal vectors. From basic linear algebra, we know that it is possible to choose the remaining columns of U so that U becomes a unitary matrix. We have thus shown, that we can write Φ on the form (23). (2) (1). Suppose Φ is a Stinespring map. Clearly Φ is linear. We have tr Φ(τ) = tr(uτ ϱ E U ) = tr(τ ϱ E ) = tr(τ)tr(ϱ E ) = tr(τ), so Φ is trace preserving. Note that (I Φ)(x) = tr F ((I U)x ϱ E (I U) ), (24) for any x B(V H A H E ). Here I denotes the identity operator on the Hilbert space V. The formula (24) is easily seen to hold for x on the form x = y z, with y B(V ) and z B(H A H E ). Both sides of (24) are linear in x, so the equation must hold for all x B(V H A H E ) as well. For x 0 we have x ϱ E 0 and hence (I U)x ϱ E (I U) ) 0 by unitary invariance. It follows that (I Φ)(x) 0. Hence Φ is a quantum channel. 3 the basis elements are ordered lexicographically, that is, according to the convention: (1,1), (1,2),..., (2, 1),(2,2), etc.

16 Capacity 13 3 Capacity In Chapter 1 we defined the capacity of a classical channel Ψ as C(Ψ) = max { H(Ψπ) π(x)h(ψδ x ) }, (25) π P(X) and the Holevo capacity of a quantum channel Φ as x χ(φ) = sups( E i π i ϱ i ) j p j S(ϱ j ). (26) So far these are merely numbers associated with channels, and it is far from clear why they deserve to be named capacities. In this chapter we start from scratch and setup a reasonable definition of capacity in terms of information transmission. We begin with the classical case. The main theorem for classical channels is Shannon s Noisy Channel Coding Theorem which states that the capacity of a classical (noisy) channel is given by the formula (25). There are different proofs of this theorem in the literature. The proof we will present uses the idea of random coding by typical sequences and is taken from [2]. The random coding techniques will be reused later in the proof of the HSW Theorem, Theorem 4.7, at the end of this chapter. After considering capacities of classical channels we move onto quantum channels. The theory will be developed in parallel with the classical counterpart, but with emphasis on points where quantum theory is different. Based on the notion of entanglement we introduce four different capacities. We then move on to derive a formula for each of these capacities, similar to Shannon s formula. One of these four capacities is the Holevo as given 26 and this is the content of the aforementioned HSW Theorem. 3.1 Notation Until now we have expressed quantities such as entropy and mutual information in terms of probability distributions. Equivalent formulations in terms of random variables sometimes simplify notation. We shall consider random variables many places in comming chapters, so let us take this opportunity to refresh basic definitions. A general random variable is a measurable function X : Ω S, where (Ω, E, p) is a probability space, with probability measure p and σ-algebra E. Furthermore (S, F) is a measurable space, with σ-algebra F. The push forward measure X (p) on S, induced by X is called the distribution of X. It is defined by X (p)(a) = p(x 1 A), for A F. (27) For a measure α on (S, F) and a non-negative integrable function f w.r.t. α, we define a measure f α on (S, F) given by f α(a) = fdα for A F. (28) A

17 14 Capacity In some situations, the sample space (Ω, E,p) plays no role in the arguments and only the distribution matters. Then it is customary to "define" the random variable by specifying the image space (S, F) together with the distribution. If S = R n or C n, F is by default the Borel algebra. If S is discrete, F is by default the family of all subsets. A probability density function (pdf) for the random variable X on S, is a non-negative measurable function f : S R, s.t. P(X V ) = p({s Ω: X(s) V }) = fd(x p), (29) for any V F. Here P is the generic symbol for "probability of". The integration is with respect to the distribution. V Discrete Random Variables Suppose X is a discrete random variable with distribution p on a set X. Then p is uniquely given by the values p X (x) := p({x}), for x X, and with a slight misuse of words it is customary to refer to the function p X as the distribution of X. For two discrete random variables X and Y on the finite sets X and Y, we denote by (X,Y ) a random variable on X Y with marginal random variables X and Y. This means that p X (x) = y Y p(x,y) and p Y (y) = x X p(x,y) are the distributions of X and Y, respectively. The notation (X, Y ) does not uniquely specify the random variables as two different random variables can have the same marginals. Let us introduce some entropy quantities we will need in the following. Two of these, the Shannon entropy and the mutual entropy, we already encountered in Chapter 1. Definition 3.1 (Entropy) Let (X,Y ) be a random variable on a set X Y with probability distribution p. Here X and Y are the marginal random variables on X and Y, respectively. We define the following quantities. Entropy: H(X) = x p X (x)log p X (x) (30) Joint Entropy: H(X,Y ) = x,y p(x,y)log p(x,y) (31) Conditional Entropy: H(X Y ) = H(X,Y ) H(Y ) = y p Y (y)h(x Y = y) (32) Mutual Information: H(X : Y ) = H(X) + H(Y ) H(X,Y ). (33) The entropy H(X) of a random variable X quantifies the amount of uncertainty about the value of X before we learn its value. We will not give a review of general classical information theory here. The reader interested in learning more about entropy is encourage to consult for example [2]. Concavity of entropy implies the following result Proposition 3.2 (Subadditivity of Shannon Entropy) Let X and Y be discrete random variables with joint random variable (X, Y ). The Shannon entropy is subadditive, meaning that, H(X,Y ) H(X) + H(Y ), and equality holds if and only if X and Y are independent.

18 3.1. Notation 15 Proof. We can w.l.o.g. assume that p X and p Y are nonzero (if not, we can obtain new random variables X, Y and (X,Y ) with H(X ) = H(X), H(Y ) = H(Y ) and H(X,Y ) = H(X,Y ) by restricting X and Y to the set for which p X and p Y is nonzero, respectively). We have log xln 2 = ln x x 1 for all positive x and equality holds if and only if x = 1. Let p be the probability distribution of (X,Y ). We have H(X) + H(Y ) H(X,Y ) = x p X (x)log p X (x) y p Y (y)log p Y (y) + x,y p(x,y)log p(x,y) = p(x, y) p(x,y)log p x,y X (x)p Y (y) 1 ( p(x, y) 1 p(x,y) ) ln2 p x,y X (x)p Y (y) = 1 ( p(x,y) px (x)p Y (y) ) ln2 = 0 x,y Equality holds in the third line if and only if p X (x)p Y (y) = p(x,y) for all x and y. That is, if and only if X and Y are independent. Random Variables Induced by Channels Let Φ: X Y [0,1] be a classical channel, and suppose we are given a random variable X on X. The channel ΦU induce a random variable on X Y and consequently also a marginal random variable on Y which we denote by ΦX. In this particular case the joint random variable (X, ΦX) is defined to have the distribution p(x,y) = Φ(x,y)p X (x), where p X is the distribution of X. Thus the distribution of ΦX is given by p ΦX (y) = x Φ(x,y)p X(x). A map F : X Y, where X and Y are finite sets, can be viewed as a noiseless classical channel with input X and output Y. That is, it corresponds to the channel (x,y) δ F(x) (y), where δ is the Kronecker delta. When there is no risk of misunderstandings we will denote the corresponding noiseless classical channel by the same letter. Composition of channels Two channels Φ 1 : X Y [0,1], and Φ 2 : Y Z [0,1] can be composed 4 into a channel Φ 2 Φ 1 : X Z [0,1], given by Φ 2 Φ 1 (x,z) = y Φ 1 (x,y)φ 2 (y,z). (34) It is easy to verify, that is associative. As this is not ordinary function composition we have used the symbol " " rather than " " to represent it. The definition is based on the intuitive idea of composing channels in which the output of the first channel 4 The composition notation here is not standard in the literature.

19 16 Capacity is the input of the second, as displayed in Figure 4. Note that Φ 2 Φ 1 (x,z) is the probability of getting z out, given that x was sent. To follow convention from ordinary function composition the rightmost channel in the composition Φ 2 Φ 1 is the channel first applied. Φ 1 Φ 2 X Y Z Φ 2 Φ 1 Figure 4 The composition of the channels Φ 1 and Φ Capacity of a classical channel Suppose Alice and Bob can communicate via a classical channel, and that Alice wants to send information to Bob. What is the maximal number of bits per use of the channel that Alice can send? Let us make the setup precise. Suppose we are given a channel Φ : X Y [0,1] with probability matrix Φ(x,y) = P(y x), for x X and y Y. Alice has a list of possible messages {1,...,2 m }, where m N. For the sake of simplicity, we assume the number of messages to be a power of 2. Alice performs an encoding described by a map C m : {1,...,2 m } X. The message is then sent through the channel Φ to Bob who performs a decoding, described by a map D m : Y {1,...,2 m }. We call m the (bit) length of the coding and decoding, respectively. There is no guarantee, that the message sent by Alice is identical to the one Bob reads after decoding. This will in general depend on the encoding, the channel and the decoding. Consider first a channel K : Z Z [0,1], with the same input and output alphabet. For a channel of this form, define the average error δ, by δ(k) = 1 [1 K(i,i)], (35) n i Z where n = Z is the number of elements in the alphabet. Note that δ(k) is the probability that Bob does not receive the same letter that Alice sent, given that each letter is sent equally often through the channel. For an arbitrary channel Φ as above and a coding C and decoding D of bit length m, consider the channel K = D Φ C. This is a channel with identical input and output alphabets. Note that δ(k) is the probability that Bob does not read out the same letter that Alice sent, given that all messages are sent equally often by Alice. Define the error δ m of Φ by, δ m (Φ) = min δ(d Φ C), (36) C,D where the minimum is over all encodings C and decodings D of bit length m. The minimum makes sense since there is a finite number of possible encodings and decodings. An encoding/decoding scheme attaining the minimum is called optimal. Note

20 3.2. Capacity of a classical channel 17 that δ m (Φ) is the smallest possible error you can get, when you require to send 2 m messages. It is not hard to show that δ m (Φ) 1 as m and clearly δ 0 (Φ) = 0 (only one letter in the alphabet). An encoding/decoding scheme of length m for Φ is called perfect, if δ m (Φ) = 0. If we apply the channel multiple times we can of course send more messages with small average error. The relationship between the number of times n we apply the channel and the alphabet bit size m is crucial. The fraction m n is the bit rate of channel, when applied n times. In Figure 5 a coding protocol for n uses of the channel Φ is illustrated. Classical Coding/Decoding Protocol Messages Φ Alice s Bob s C Φ n code- words Φ code- words Φ Φ D n Messages K n Figure 5 A message is first encoded using the map C n, then sent through the channel Φ n, and finally decoded by the map D n. The composition of encoding, the channel Φ n and the decoding, is equivalent to a classical channel K n = D n Φ n C n. We are interested in protocols with arbitrarily low average error. This leads to the following definition. Definition 3.3 (Reliable rate) Let R 0. Then R is called a reliable rate if there exist a strictly increasing sequence (a k ) k in N s.t. δ ak R (Φ a k ) 0. (37) Here is the "ceiling" function, which for a given x R returns the unique integer k, s.t. k 1 < x k. In other words, a rate R 0 is reliable if we can get an arbitrarily small error by applying the channel enough times, while still sending R nr /n bits per use of the channel, as n is large. The capacity is the ultimate bit rate with which we can reliably send messages through the channels. This is the content of the following definition. Definition 3.4 (Capacity of a classical channel) For a channel Φ, the capacity C(Φ) is defined as C(Φ) = sup{r: R reliable}. From this definition it is not at all clear how to calculate the capacity of a specific channel. The following theorem, by Claude Shannon, is one of the main theorems of classical information theory. The theorem provides a formula for the capacity from which it can be calculated - at least numerically. Theorem 3.5 (Shannon s Noisy Channel Coding Theorem) For a classical channel Φ: X Y [0,1], the capacity C(Φ) is given by

21 18 Capacity C(Φ) = suph(x : ΦX), (38) X where X is a random variable on X and ΦX is the induced random variable on Y. The proof is rather long so we will separate it into two according to the inequalities " " and " ". For " " we need the following two lemmas. Lemma 3.6 (The Fano Inequality) Let U be a random variable defined on {u 1,...,u m } and let V be a random variable defined on {v 1,...,v m } s.t. their joint probability distribution p(u, v) satisfies Then H(U V ) 1 + δ log m. m p(u i,v i ) = 1 δ, δ > 0. i=1 Proof. We have H(U V ) = v j p V (v j )H(U V = v j ). (39) Now fix j and put s i = p(u i v j ). Then H(U V = v j ) = i s i log s i = H({s j,1 s j }) + (1 s j )H({ s 1 1 s j,..., 1 + (1 s j )log m. ŝ j 1 s j,..., Here denotes that the element is omitted from the list. Thus (39) gives s m 1 s j }). H(U V ) v j p V (v j )(1 + (1 s j )log m) = 1 + (1 v j p V (v j )s j )log m = 1 + (1 v j p(u j,v j ))log m = 1 + δ log m. Corollary 3.7 Let K : Z Z [0,1] be a classical channel with the same input and output alphabet and with Z = 2 m for m N. Then m 1 + δ(k)m + suph(x : KX), (40) X where the supremum is over all random variables X on Z.

22 3.2. Capacity of a classical channel 19 Proof. Let U be the equidistributed random variable on Z and put V = KU. Then, m = H(U) = H(U V ) + H(U : V ) (see Def. 3.1) (41) The distribution of (U,V ) is given by p(u,v) = K(u,v) 1 2 m. Thus δ(k) = 1 2 m [1 K(i,i)] i Z = 1 i Z p(i,i). By Lemma 3.6, H(U V ) 1 + δ(k)m and the corollary follows. A sequence of random variables X 1 X 2 X 3 are said to form a Markov chain if X n+1 is independent of X 1,...,X n 1, given X n. Formally this means that, p(x n+1 = x n + 1 X n = x n,...,x 1 = x 1 ) = p(x n+1 = x n+1 X n = x n ), where x i X i and X i is a random variable on X i. Given a composition of classical channels and a random variable on the input space of the first channel, the random variables induced by the channels will form a Markov chain. This is our primary motivation for studying Markov chains. Mutual information can only decrease in a Markov chain as the following theorem states. Lemma 3.8 (Data Processing Inequality) Let X,Y,Z be random variables s.t. X Y Z forms a Markov chain; that is p(z y) = p(z x,y), (42) where x, y and z is shorthand for X = x, Y = y and Z = z. Then, H(X) H(X : Y ) H(X : Z). Proof. From the definition of Shannon entropy it is seen that H(X : Y ) H(X : Z) is equivalent to H(X Y ) H(X Z). Unraveling definitions it is easily seen that (42) is equivalent to p(x y) = p(x y,z), saying that Z Y X is a Markov chain. From this we immediately get H(X Y ) = H(X Y, Z). Now using the definition of conditional entropy we have H(X Y,Z) = H(X,Y Z) H(Y Z) H(X Z). By the last equation in the definition of conditional entropy, Definition 3.1, it suffices to show H(X,Y Z = z) H(Y Z = z) H(X Z = z), for any z. However, this follows from subadditivity of Shannon entropy. This proves the second inequality. The first inequality follows by applying what we have just shown to the Markov chain X X Y using that H(X) = H(X : X). We are now ready to prove the " " part of Shannons Noisy Channel Coding Theorem.

23 20 Capacity Theorem 3.9 For a classical channel Φ: X Y [0, 1], the capacity C(Φ) satisfies C(Φ) suph(x : ΦX), (43) X where the supremum is over all random variables X on X. Proof.. Suppose R is a reliable rate and fix n N. Let C and D be optimal encodings and decodings for Φ n, respectively, of bit length m. Consider the classical channel K = D Φ n C : Z Z [0,1] induced by this encoding/decoding scheme. Here Z = {1,...,2 m }. Optimality implies that δ(k) = δ m (Φ n ). From Corollary 3.7 we get that m 1 + δ m (Φ n )m + suph(u : KU), (44) U where the supremum is over all random variables U on Z. Let now U be a fixed random variable on Z. Consider the Markov chain of random variables, U C S Φ n T D V, (45) where by definition S = CU, T = Φ n S and V = DT = KU. By the Data Processing Inequality, we get H(U : V ) H(U : T) H(S : T). For the last term in (44) we thus get sup U H(U : KU) suph(s : Φ n S), (46) S where the supremum on the left-hand-side is over all random variables U on Z and the supremum on the right-hand-side is over all random variables S on X n. According to Proposition 1.3 of Section 1.1, the right-hand-side is additive in Φ, meaning that sup S H(S : Φ n S) = n suph(x : ΦX), (47) X where the supremum on the right-hand-side is over all random variables X on X. Since R is a reliable rate, there exist a strictly increasing sequence (a k ) k s.t. δ ak R (Φ a k ) 0, (48) as k. Setting m = a k R in (44) and dividing by a k we get, R a kr a k 1 a k + δ ak R (Φ a k ) a kr a k + sup S H(S : ΦS) suph(x : ΦX), (49) X as k. This proves the claim. Typicality In order to prove the part of Shannon s Noisy Coding Theorem we first need to establish some properties of randomly generated sequences. These sequences will later be used as codewords in the proof of Shannon s theorem.

24 3.2. Capacity of a classical channel 21 Consider an alphabet X = {1,...,m} and a random variable X on X with probability distribution p 1,...,p m. Consider also the random variable X (n) = (X 1,...,X n ) on X n, where X i = X are independent. Now suppose we choose a codeword x (n) = (x 1,...,x n ) X n according to the distribution of X (n). That is the same as choosing each letter x i randomly and independently according to the probability distribution p 1,...,p m. If n is large, we would expect a sequence x (n) to contain approximately p 1 n 1 s, p 2 n 2 s... and p m n m s. Suppose for simplicity that p 1 n,p 2 n,...,p m n N and that x (n) consist of exactly this number of 1 s, 2 s etc. Let us give a rough estimate of the probability p(x (n) ) = P(X (n) = x (n) ), that is, the probability of choosing exactly the sequence x (n). From elementary combinatorics, we get p(x (n) ) = n! (p 1 n)!(p 2 n)!... (p 3 n)!. (50) Stirlings approximation gives log(n!) n log n when n is large, so log p(x (n) ) n log n p 1 n log(p 1 n)...p n n log(p n n) = nh(x). Thus p(x (n) ) 2 nh(x) or equivalently 1 n log p(x(n) ) H(X) 0 (51) Any sequence satisfying (51), is said to be typical (to be defined rigorously below). Now let Y be another alphabet and consider a random variable (X,Y ) on X Y, the marginal random variable X on X, and the marginal random variable Y on Y. As above, each of these random variables give rise to a notion of typical sequences in the spaces X n, Y n and X n Y n, respectively. If x (n) X n is a typical sequence and y (n) Y n is a typical sequence and the combined sequence (x (n),y (n) ) X n Y n is typical, x (n) and y (n) are said to be jointly typical. The precise definition is the following: Definition 3.10 (Jointly typical sequences) For ɛ > 0 and a random variable (X, Y ) on X Y with probability distribution p(x,y), we say that the sequences x (n) X n and y (n) Y n are jointly ɛ-typical if the following 3 properties hold: 1 n log p(x(n) ) H(X) < ɛ 1 n log p(y(n) ) H(Y ) < ɛ 1 n log p(x(n),y (n) ) H(X,Y ) < ɛ.

25 22 Capacity When n is large and we randomly choose a pair of sequences x (n) and y (n) according to the distribution of (X,Y ) (n) it is very likely to find the sequences to be jointly typical. If X and Y are correlated (= not independent), then the mutual information is non-zero, i.e. H(X : Y ) 0. If we choose sequences x (n) and y (n) independently according the distributions of X (n) and Y (n), respectively, then it is very unlikely that the sequences are jointly typical. This is the content of the following theorem. Theorem 3.11 (Asymptotic equipartition property) Let X and Y be finite sets, and let (X,Y ) be a random variable on the product set X Y, where X and Y are the marginal random variables on X and Y, respectively. Let (X,Y ) (n) denote the random variable on X n Y n, formed out of n independent copies of (X,Y ) and with marginals X (n) and Y (n). Let A ɛ (n) denote the set of jointly typical sequences in X n Y n. Then 1. P(A (n) ɛ ) 1 as n. 2. A ɛ (n) 2 n(h(x,y )+ɛ), where denotes the number of elements in the set. 3. Given independent random variables X (n), Ỹ (n) marginals as X (n),y (n) then The set A (n) ɛ is illustrated on Figure 6. P(( X (n),ỹ (n) ) A (n) ɛ ) 2 n(h(x:y ) 3ɛ). (52) The Typical Sequences A (n) ɛ Y n 2 nh(y ) 2 nh(x,y ) 2 nh(x) X n Figure 6 Sets of typical sequences. If (x (n),y (n) ) A (n) ɛ X n Y n, that is (x (n),y (n) ) is jointly typical, then in particular x (n) is typical and y (n) is typical. There are 2 nh(x) typical sequences x (n), 2 nh(y ) typical sequences y (n) and 2 nh(x,y ) jointly typical sequences (x (n),y (n) ). The proof of Theorem 3.11 is an easy application of the Weak Law of Large Numbers. Theorem 3.12 (Weak Law of Large Numbers) Consider a sequence of independent, identically distributed (i.i.d) random variables X i with finite mean E(X i ) = µ and finite variance E( X µ 2 ) = σ 2. Then for any ɛ > 0, lim n P ( 1 n n i=1 ) X i µ ɛ = 0. (53)

Entropies & Information Theory

Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information