Probabilistic Knowledge and Probabilistic Common Knowledge

Size: px

Start display at page:

Download "Probabilistic Knowledge and Probabilistic Common Knowledge"

Calvin Hudson
5 years ago
Views:

1 Probabilistic Knowledge and Probabilistic Common Knowledge Paul Krasucki 1, Rohit Parikh 2 and Gilbert Ndjatou 3 Abstract: In this paper we develop a theory of probabilistic common knowledge and probabilistic knowledge in a group of individuals whose knowledge partitions are not wholly independent. 1 Introduction Our purpose in this paper is to extend conventional information theory and to address the issue of measuring the amount of knowledge that n individuals have in common. Suppose, for example, that two individuals have partitions which correspond closely, then we would expect that they share a great deal. However, the conventional definition of mutual knowledge may give us the conclusion that there is no fact which is mutually known, or even known to one as being known to another. This is unfortunate because [CM] and [HM] both give us arguments that seem to show that common knowledge (mutual knowledge if two individuals are involved) is both difficult to attain and necessary for certain tasks. If however, we can show that probabilistic knowledge is both easier to attain and a suitable substitute in many situations, then we have made progress. See [Pa2] for a description of situations where partial knowledge is adequate for communication. To this end, we shall develop a theory of probabilistic common knowledge which turns out to have surprising and fruitful connections both with traditional information theory and with Markov chains. To be sure, these theories have their own areas of intended application. Nonetheless, it will turn out that our mathematical theory has many points in common with these two theories. The standard Logics of Knowledge tend to use Kripke models with S5 accessibility relations, one for each knower. One can easily study instead the partitions corresponding to these accessibility relations and we shall do this. We also assume that the space W of possible worlds has a probability measure µ given with it. 1 Department of Computer Science, Rutgers-Camden 2 Department of Computer Science, CUNY Graduate center, 33 West 42nd Street, New York, NY RIPBC@CUNYVM.CUNY.EDU. 3 Department of Computer Science, College of Staten Island, CUNY and CUNY Graduate center 1

2 In Figure I below, Ann has partition A = {A 1, A 2 } and Bob has partition B = {B 1, B 2 } so that each of the sets A i, B j has probability.5 and the intersections A i B j have probability.45 when i = j and.05 otherwise. The vertical line divides A 1 from A 2 The slanted line divides B 1 from B Figure I.45 Since the meet of the partitions is trivial, there is no common knowledge in the usual sense of [Au], [HM]. In fact there is no nontrivial proposition p such that Ann knows that Bob knows p. It is clear, however, that Ann and Bob have nearly the same information and if the partitions are themselves common knowledge, then Ann and Bob will be able to guess, with high probability, what the other knows. We would like then to say that Ann and Bob have probabilistic common knowledge, but how much? One purpose of this paper is to answer this question and to prove properties of our definition that show why the answer is plausible. A closely related question is that of measuring indirect probabilistic knowledge. For example, we would expect that what Ann knows about Bob s knowledge is less than or equal to what Bob himself knows, and what Ann knows of Bob s knowledge of Carol is in turn less than or equal to the amount of knowledge that Bob has about Carol s knowledge. We would expect in the limit that what Ann knows about what Bob knows about what Ann knows... about what Bob knows will approach whatever ordinary common knowledge they have. It turns out that to tackle these questions successfully, we need a third notion. This is the notion of the amount of information acquired when one s probabilities change as a result of new information (which does not invalidate old information). Suppose for example that I am told that a certain fruit is a peach. I may then assign a probability of.45 to the proposition that it is sweet. If I learn then that it just came off a tree, then I will expect that it was probably picked for shipping and the probability may drop to.2, but if I learn again that it fell off the tree, then it will rise to.9. In each case I am getting information, consistent with previous information and causing me to revise my probabilities, but how 2

3 much information am I getting? 2 Preliminaries We start by giving some definitions, some old, some apparently new. If a space has 2 n points, all equally likely, then the amount of information gained by knowing the identity of a specific point x is n bits. If one only knows a set X in which x falls, then the information gained is less, in fact equal to I(X) = log(µ(x)) where µ(x) is the probability 4 of X. If P = P 1,..., P k is a partition of the whole space W, then the expected information when one discovers the identity of the P i which contains x, is H(P) = µ(p i )I(P i ) = µ(p i ) log(µ(p i )) These definitions so far are standard in the literature [Sh], [Ab], [Dr]. We now introduce a notion which is apparently new. Suppose I have a partition P = {P 1,..., P k } whose a priori probabilities are y 1,..., y k, but some information that I receive causes me to change them to u 1,..., u k. How much information have I received? Definition 1 : IG( u, y) = u i (log u i log y i ) = u i log( u i ) y i Here IG stands for information gain. Clearly this definition needs some justification. We will first provide an intuitive explanation, and then prove some properties of this notion IG which will make it more plausible that it is the right one. (a) Suppose that the space had 2 n points, and the distribution of probabilities that we had was the flat distribution. Then the set P i has 2 n y i points 5. After we receive our information, the points are no longer equally likely, and each point in P i has probability u i P i = u i y i 2 n. Thus the expected information of the partition of the 2 n singleton sets is (y i 2 n ) u i y i 2 n log( u i y i 2 n ) 4 We will use the letter µ for both absolute and relative probabilities, to save the letter p for other uses. All logs will be to base 2 and since x log(x) 0 as x 0 we will take x log(x) to be 0 when x is 0. 5 There is a tacit assumption here that the y i are of the form k/2 n. But note that numbers of this form are dense in the unit interval and if we assume that the function IG is continuous, then it is sufficient to consider numbers of this form. 3

4 which comes out to α = n u i (log u i log y i ) Since the flat distribution had expected information n, we have gained information equal to n α = n (n u i (log u i log y i )) = u i (log u i log y i ) = u i log( u i ) y i (b) In information theory, we have a notion of the information that two partitions P and Q share, also called their mutual information, and usually denoted by I(P; Q). I(P; Q) = i,j µ(p i Q j ) log( µ(p i Q j ) µ(p i ) µ(q j ) ) We will recalculate this quantity using the function IG. If Ann has partition P, then with probability µ(p i ) she knows that P i is true. In that case, she will revise her probabilities of Bob s partition from µ( Q) to µ( Q P i ) and and in that case her information gain about Bob s partition is IG(µ( Q P i ), µ( Q)). Summing over all the P i we get µ(p i ) IG(µ( Q P i ), µ( Q)) = i i µ(p i )( j (µ(q j P i ) log( µ(q j P i ) µ(q j ) )) and an easy calculation shows that this is the same as I(P; Q) = i,j µ(p i Q j ) log( µ(p i Q j ) µ(p i ) µ(q j ) ) Since the calculation through IG gives the same result as the usual formula, this gives additional support to the claim that our formula for the information gain is the right one. 3 Properties of information gain Theorem 1 : (a) IG( u, v) 0 and IG( u, v) = 0 iff u = v. (b1) If p = µ(p ) and if there is set X such that u i = µ(p i X) for all i, then IG( u, p) log(µ(x)) Thus the information received, by way of a change of probabilities, is less than or equal to the information I(X) contained in X. (b2) Equality obtains in (b1) above iff for all i, either µ(p i X) = µ(p i ), or else µ(p i X) = 0. 4

5 Thus if all nonempty sets involved have non-zero measure, every P i is either a subset of X or disjoint from it. Proof : (a) It is straightforward to show using elementary calculus that log x < (x 1) log e except when x = 1 when the two are equal. 6 Replacing x by 1/x we get log x > (1 1/x) log e except again at x = 1. This yields IG( u, v) = ( i u i (log u i )) ( i u i (1 u i )) log e = (( i u i ) ( i )) log e = 0 with equality holding iff, for all i, either u i = 1, or u i = 0. However, the case u i = 0 cannot arise, since we know that i u i = i = 1 and u i for all i. (b1) Let u i = µ(p i X), p = (µ(p 1 ),..., µ(p k )). IG( u, p) = µ(p i X) log µ(p i X) µ(p i ) µ(p i X) log µ(p i X) µ(p i ) where α = k µ(p i X) log µ(p i X) µ(p i ) = µ(p i X) log µ(p i X) µ(p i )µ(x) = µ(p i X) log µ(x)) = α + I(X) 0, since µ(p i X) µ(p i ) 1 for all i and k µ(p i X) = 1. (b2) α = 0 only if, for all i, µ(p i X) = 0 or µ(p i X) = µ(p i ), i.e. either P i X = or P i X (X is a union of the P i s). If we learn that one of the sets can be excluded, that we had initially considered possible (its probability was greater then zero), then our information gain is the least if the probability of the excluded piece is distributed over all the other elements of the partition, proportionately to their initial probabilities. the gain is greatest when the probability of the excluded piece is shifted to a single element of the partition, and this element was initially one of the least likely elements. Theorem 2 : Let v = (v 1,..., v k 1, v k ), u = (u 1,..., u k 1, u k ), where u k = 0, u i = + a i v k for i = 1,..., k 1, a i 0, k 1 a i = 1, and v k > 0. Then: (a) IG( u, v) is minimum when a i = c is the same for all i = 1,..., k 1, and c = 1 1 v k. moreover, this minimum value is just log(1 v k ). (b) IG( u, v) is maximum when a i = 1 for some i such that = min j=1...k 1 (v j ) and the other a j are 0. Proof : (a) Let a = (a 1,..., a k 2, a k 1 ). Since k 1 a i = 1 we have a k 1 = 1 k 2 a i. 6 e is of course the number whose natural log is 1. Note that log e = log 2 e = 1. The line y = (x 1) log e ln 2 is tangent to the curve y = log x at (1,0), and lies above it. 5

6 So we need only look at f : [0, 1] k 2 R, defined by: k 2 f( a) = IG( u, v) = ( +a i v k ) log + a i v k k 2 +(v k 1 +v k (1 j=1 a j )) log (v k 1 + v k (1 k 2 j=1 a j) v k 1 To find the extrema of f in [0, 1] k 2, consider the partial derivatives f a i = 0 iff +a i v k for all i, a i = a k 1 v k 1 f a i = v k (log + a i v k = (v k 1+v k (1 k 2 v k 1 j=1 a j)) log v k 1 + v k (1 k 2 v k 1 j=1 a j) ) log e. Recall that a k 1 = 1 k 2 a i. Then we have or a i = c where c is a constant and i range over 1,..., k 1. If we add these equations and use the fact that k 1 a i = 1 and the fact that k 1 = 1 v k we get c = 1 1 v k. Now f a i is an increasing function of a i, so it is > 0 iff a i > 1 v k and it is < 0 iff a i < 1 v k. Thus f has a minimum when a i = 1 v k for all i. The fact that this minimum value is log(1 v k ) is easily calculated by substitution. Note that this quantity is exactly equal to I(X) where X is the complement of the set P k whose probability was v k. Thus we have an exact correspondence with parts (b1) and (b2) of the previous theorem. (b) To get the maximum, note that since the first derivatives f a i are always increasing, and the second derivatives are all positive, the maxima can only occur at the vertices of [0, 1] k 1. (If they occurred elsewhere, we could increase the value by moving in some direction). Now the values of f at the points p j = (0,...0, 1, 0,...0) (a i = δ(i, j)), are IG( u, v) = g(v j ) where g(x) = (x+v k ) log x+v k x. But g(x) = (x+v k) log x+v k x is a decreasing function of x. so IG(u, v) is maximum when a j = 1 for some j such that v j is minimal. Example 1 : Suppose for example that a partition {P 1, P 2, P 3, P 4 } is such that all the P i have probabilities equal to.25. If we now receive the information that P 4 is impossible, then we will have gained information approximately equal to IG(.33,.33,.33, 0,.25,.25,.25,.25) 3 (.33) log log Similarly, if we discover instead that it is P 3 which is impossible. If, however, we only discover that the total probability of the set P 3 P 4 has decreased to.33, then our information gain is only IG(.33,.33,.17,.17,.25,.25,.25,.25).08, which is much less. And this makes sense, since knowing that the set P 3 P 4 has gone down in weight tells us less than knowing that half of it is no longer to be considered, and moreover which half. If we discover that P 4 is impossible and all the cases that we had thought to be in P 4 are in fact in P 1, then the information gain is IG(.50,.25,.25, 0,.25,.25,.25,.25) = 1 2 log 2 which is.5 and more than our information gain in the two previous cases. 6

7 Example 2 : As the following example shows, IG doesn t satisfy the triangle inequality. I.e. if we revise our probabilities form y to u and then again to v, our total gain can be less than revising it straight from y to v. This may perhaps explain why we do not notice gradual changes, but are struck by the cumulative effect of all of them. Take v = (0.1, 0.9), u = (0.25, 0.75), y = (0.5, 0.5). IG( v, u) + IG( u, y) = = 0.34, while IG( v, y) = 0.53 (approximately). Also IG( y, v) = 0.74 so that IG is not symmetric. Another way to see that this failure of the triangle equality is reasonable is to notice that we could have gained information by first relativising to a set X, and then to another set Y, gaining information log(µ(x)) and log(µ(y )) respectively. However, to get the cumulative information gain, we might need to relativise to X Y whose probability might be much less than µ(x)µ(y ). We have defined the mutual knowledge I(P; Q) of two partitions P, Q. If we denote their join as P +Q then the quantity usually denoted in the literature as H(P, Q)) is merely H(P + Q). The connection between mutual information and entropy is well known [Ab]: H(P + Q) = H(P) + H(Q) I(P; Q) Moreover, the equivocation H(P Q) of P with respect to Q is defined as H(P Q) = H(P) I(P; Q). If i and j are agents with respective partitions P i and P j respectively, then inf(ij) will be just I(P i ; P j ). The equivocations are non-negative, and I is symmetric, and so we have: I(P; Q) min(h(p), H(Q)) Thus what Ann knows about Bob s knowledge is always less than what Bob knows and what Ann herself knows. We want now to generalise these notions to more than two people, for which we will need a notion from the theory of Markov chains, namely stochastic matrices. We start by making a connection between boolean matrices and the usual notion of knowledge. 4 Common knowledge and Boolean matrices We start by reviewing some notions from ordinary knowledge theory, [Au], [HM], [PK]. Definition 2 : Suppose that {1,...,k} are individuals and i has knowledge partition P i. If w W then i knows E at w iff P i (w) E, where P i (w) is the element of the partition 7

8 P i containing w. K i (E) = {w i knows E at w}. Note that K i (E) is always a subset of E. Write w i w if w and w are in the same element of the partition P i (iff P i (w) = P i (w )). Then i knows E at w iff for all w, w i w w ɛe. Also, it follows that i knows that j knows E at w iff wɛk i (K j (E)) iff l n {P j l P j l P i (w) } E i.e. {w v such that w i v j w } E. Definition 3 : An event E is common knowledge between a group of individuals i 1,..., i m at w iff ( j 1,..., j k {i 1,..., i m })(w j1 w 1,..., w k 1 jk w ) (w E) iff for all Xɛ{K 1,..., K n } wɛx(e). We now analyse knowledge and common knowledge using boolean transition matrices 7 : Definition 4 : The boolean transition matrix B ij of ij is defined by letting B ij (k, l) = 1 if P k i P l j, and 0 otherwise. We can extend this definition to a string of individuals x = i 1...i k : Definition 5 : The boolean transition matrix B x for a string x = i 1...i k is B x = B i1 i 2 B i2 i 3... B ik 1 i k where is defined as normalised matrix multiplication: if (B B )(k, l) > 0 then (B B )(k, l) is set to 1, otherwise it is 0. We can also define as: (B B )(k, l) = n m=1 (B(k, m) B (m, l)) We say that there is no non-trivial common knowledge iff the only event that is common knowledge at any w is the whole space W. Fact 1 : There is no non-trivial common knowledge iff for every string x including all individuals, lim n B x n = 1 where 1 is the matrix filled with 1 s only. We now consider the case of stochastic matrices. 5 Information via a string of agents When we consider boolean transition matrices, we may lose some information. If we know the probabilities of all the elements of the σ-field generated by the join of the partitions P i, the boolean transition matrix B ij is created by putting a 1 in position (k, l) iff µ(p l j P k i ) > 0, and 0 otherwise. We keep more of the information by having µ(pj l P i k ) in position (k, l). We denote this matrix by M ij and we call it the transition matrix from i to j. 7 The subscripts to the matrices will denote the knowers, and the row and column will be presented explicitly as arguments. Thus B ij(k, l) is the entry in the kth row and jth column of the matrix B ij. 8

9 Definition 6 : For every i, j, the ij-transition matrix M ij is defined by: M ij (a, b) = µ(p b j P a i ). For all i, M ii is the unit matrix of dimension equal to the size of partition P i. Definition 7 : If x is a string of elements of {1,..., k} (x {1,..., k}, x = x 1...x n ), then M x = M x1 x 2... M xn 1 x n is the transition matrix for x. We now define inf(ixj), where x is a sequence of agents. inf(ixj) will be the information that i has about j via x. If e.g. i = 3, x = 1, j = 2, we should interpret inf(ixj) as the amount of information 3 has about 1 s knowledge of 2. Example 3 : In our example in the introduction, If i were Ann and j were Bob, then we would get M ij = The matrix M ji equals the matrix M ij and the matrix M iji is M iji = Thus it turns out that each of Ann and Bob has.53 bits of knowledge about the other and Ann has.32 bits of knowledge about Bob s knowledge of her. Definition 8 : Let m l = (m l1,..., m lk ) be the lth row vector of the transition matrix M ixj (m lt = µ(pj t xpi l), where µ(p j t xpi l ) is the probability that a point in P l i will end up in P t j after a random move within P l i within the elements of those P xr followed by a sequence of random moves respectively which form x). Then: inf(ixj) = µ(pi l )IG( m l, µ(p j )) l=1 where IG( m l, µ(p j )) is the information gain of the distribution m l over the distribution µ(p j ). The intuitive idea is that the a priori probabilites of j s partition are µ(p j ). However, if w is in Pi l, the l th set in i s partition, then these probabilities will be revised according to the l th row of the matrix M ixj and the information gain will be IG( m l, µ(p j )). The expected information gain for i about j via x is then obtained by multiplying by the µ(p l i ) s and summing over all l. Example 4 : Consider M iji. For convenience we ll denote elements P m i elements P m j by A m and by B m (so that the A s are elements of i s partition, and the B s are elements 9

10 of j s partition). Therefore M iji = M ij M ji where: µ(b 1 A 1 ) µ(b k A 1 ) µ(b 1 A 2 ) µ(b k A 2 ) M ij =.... M ji =. µ(b 1 A k ) µ(b k A k ) µ(a 1 B 1 ) µ(a k B 1 ) µ(a 1 B 2 ) µ(a k B 2 )..... µ(a 1 B k ) µ(a k B k ) M iji is the matrix of probabilities µ(a l j A m ) for l, m = 1,..., k, where µ(a l j A m ) is the probability that a point in A m, will end up in A l after a random move within A m followed by a random move within some B s. µ(a 1 j A 1 ) µ(a 1 j A 2 ) µ(a 1 j A k ) µ(a 2 j A 1 ) µ(a 2 j A 2 ) µ(a 2 j A k ) M iji = µ(a k j A 1 ) µ(a k j A 2 ) µ(a k j A k ) Note that for x = λ, where λ is the empty string, inf(ij) = I(P i ; P j ), as in the standard definition: inf(ij) = k l=1 µ(pi l)ig( µ(p j Pi l), µ(p j )) = k l=1 µ(pi l) k t=1 µ(p j Pi l) log µ(p j t P i l) = k l,t=1 µ(p j P l i ) log µ(p t j P l i ) µ(p t j )µ(p l i ) µ(p t j ) 6 Properties of transition matrices The results in this section are either from the theory of Markov chains, or easily derived from these. Definition 9 : A matrix M is stochastic if all elements of M are reals in [0,1] and the sum of every row is 1. Fact 2 : For every x, the matrix M x is stochastic. Definition 10 : A matrix M is regular if there is m such that (k, l)m m (k, l) > 0. The following fact establishes a connection between regular stochastic matrices and common knowledge: Fact 3 : individuals from x. Matrix M ixi is regular iff there is no common knowledge between i and Fact 4 : For every regular stochastic matrix M, there is a matrix M such that lim M n = M n M is stochastic, and all the rows in M are the same. Moreover the rate of convergence is exponential: for a given column r, let d n (r) be the difference between the maximum and 10

11 the minimum in M n, in that column. Then there is ɛ < 1 such that for all columns r and all sufficiently large n, d n (r) ɛ n. By combining the last two facts we get the following corollary: Fact 5 : If there is no common knowledge between i and the individuals in x, then lim (M ixi) n = M n where M is stochastic, and all rows in M are equal to the vector u i of probabilities of the sets in the partition P i. A matrix with all rows equal represents the situation that all information is lost and all that is known is the a priori probabilities. Fact 6 : If L, S are stochastic matrices and all the rows of L are equal, then S L = L, and L S = L, where all rows in L are equal (though they may be different from those of L). Fact 7 : For any stochastic matrix S and regular matrix M ixi : S lim n (M ixi) n = M where M = lim n (M ixi) n Definition 11 : For a given partition P i and string x = x 1 x 2...x k we can define a relation x between the partitions P i and P j. P m i x P n j iff for w P m i and w P n j, there are v 1,..., v k 1 such that v 1 P m i, v k P n j and w x1 v 1 x2...v k 1 xk w. Definition 12 : x is the transitive closure of x. It is an equivalence relation. Fact 8 : Assume that x contains all j. Then the relation x does not depend on the particular x and we may drop the x. P m i P n j iff P m i and P n j are subsets of the same element of P where P is the meet of the partitions of all the individuals. Observation: We can permute the elements of the partition P i so that the elements of the same equivalence class of have consecutive numbers and then M ixi looks as follows: M ixi = M M r where M l for l r is the matrix corresponding to the transitions within one equivalence class of. All submatrices M l are square and regular. 11

12 Note that if there is no common knowledge then has a single equivalence class. Since we can always renumber the elements of the partitions so that the transition matrix is in the form described above, we will assume from now on that the transition matrix is always given in such a form. Fact 9 : If x contains all j then lim (M ixi) n = M n where M is stochastic, submatrices M l of M are regular (in fact positive) and all the rows within every submatrix M l are the same. 7 Properties of inf(ixj) Theorem 3 : If there is no common knowledge and x includes all the individuals, then lim n inf(i(jxj)n ) = 0 Proof : Matrix M = lim n (M jxj ) n has all rows positive and equal. Let m be a row vector of M. Then lim n inf(i(jxj) n ) = IG( m, µ(p j )). Since the limiting vector m is equal to the distribution µ(p j ), we get: lim n inf(i(jxj) n ) = IG( µ(p j ), µ(p j )) = 0. The last theorem can be easily generalised to the following: Fact 10 : If there is no common knowledge among the individuals in x, and i, j occur in x, then as n, inf(ix n j) goes to zero. 8 Probabilistic common knowledge Common knowledge is very rare. But, even if there is no common knowledge in the system, we often have probabilistic common knowledge. Definition 13 : Individuals {1,..., n} have probabilistic common knowledge if x {1,..., n} inf(x) > 0 We note that there is no probabilistic common knowledge in the system iff there is some string x such that for some i, M xi is a matrix with all rows equal and M xi (, t) = µ(p t i ) for all t. 12

13 Theorem 4 : If there is common knowledge in the system then there is probabilistic common knowledge, and x {1,..., n} inf(x) H(P ) Proof : We know from Fact 9 that M ixi = M M r where M l for l r is the matrix corresponding to the transitions within one equivalence class of x, and all submatrices M l are square and regular. Here r is the number of partitions in P. Suppose that the probabilities of the sets in the partition P i are u 1,..., u k and that the probabilities of the partition P are w 1,..., w r. Each w j is going to be the sum of those u l where the lth set in the partition P i is a subset of the jth set in the partition P. Let m l be the lth row of the matrix M ixi. Then inf(ixi) is k l=1 u l IG( m l, u). The row m l consists of zeroes, except in places corresponding to subsets of the apppropriate element P j of P. Then, by theorem 2, part (a): IG( m i, u) log( 1 1 (1 w j ) ) = log w j. This quantity may repeat, since several elements of P i may be contained in P j. When we add up all the multipliers u i that occur with log w j, these multipliers also add up to w j. Thus we get r inf(ixi) w j log(w j ) = H(P ) j=1. We can also show: Theorem 5 : If x contains i, j and there is common knowledge between i, j and all the components of x, then the limiting information always exists and lim n inf(i(jxj) n ) = H(P ) We postpone the proof to the full paper. References [Ab] Abramson, N., Information Theory and Coding, McGraw-Hill, 1963 [AH] M. Abadi and J. Halpern, Decidability and expressiveness for first-order logics of probability, Proc. of the 30th Annual Conference on Foundations of Computer Science, 1989, pp [Au] Aumann, R., Agreeing to Disagree, Annals of Statistics, 1976, 4, pp

14 [Ba] F. Bacchus, On Probability Distributions over Possible Worlds, Proceedings of the 4th Workshop on Uncertainty in AI, 1988, pp [CM] H. H. Clark and C. R. Marshall, Definite Reference and Mutual Knowledge, in Elements of Discourse Understanding, Ed. Joshi, Webber and Sag, Cambridge U. Press, [Dr] F. Dretske. Knowledge and the Flow of Information, MIT Press, [Ha] J. Halpern, An analysis of first-order logics of probability, Proc. of the 11th International Joint Conference on Artificial Intelligence (IJCAI 89), 1989, pp [HM] Halpern, J. and Moses, Y., Knowledge and Common Knowledge in a Distributed Environment, Proc. 3rd ACM Conf. on Principles of Distributed Computing, 1984, pp [KS] Kemeny, J. and Snell, L., Finite Markov Chains, Van Nostrand, 1960 [Pa] Parikh, R., Levels of Knowledge in Distributed Computing, Proc. IEEE Symposium on Logic in Computer Science, 1986, pp [Pa2] Parikh, R., A Utility Based Approach to Vague Predicates To appear. [PK] Parikh, R. and Krasucki, P. Levels of Knowledge in Distributed Computing, Research report, Brooklyn College, CUNY (1986). Revised version of [Pa] above. [Sh] Shannon, C., Mathematical Theory of Communication Bell System Technical Journal, 28, (Reprinted in: Shannon and Weaver, A Mathematical Theory of Communication University of Illinois Press, 1964.) 14

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland.

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland. Measures These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland. 1 Introduction Our motivation for studying measure theory is to lay a foundation