ST213 Mathematics of Random Events

Size: px

Start display at page:

Download "ST213 Mathematics of Random Events"

Jeffry Garey Carter
6 years ago
Views:

1 ST213 Mathematics of Random Events Wilfrid S. Kendall version April Introduction The main purpose of the course ST213 Mathematics of Random Events (which we will abbreviate to MoRE) is to work over again the basics of the mathematics of uncertainty. You have already covered this in a rough-and-ready fashion in: (a) ST111 Probability; and possibly in (b) ST114 Games and Decisions. In this course we will cover these matters with more care. It is important to do this because a proper appreciation of the fundamentals of the mathematics of random events (a) gives an essential basis for getting a good grip on the basic ideas of statistics; (b) will be of increasing importance in the future as it forms the basis of the hugely important field of mathematical finance. It is appropriate at this level that we cover the material emphasizing concepts rather than proofs: by-and-large we will concentrate on what the results say and so will on some occasions explain them rather than prove them. The third-year courses MA305 Measure Theory, and ST318 Probability Theory go into the matter of proofs. For further discussion of how Warwick probability courses fit together, see our road-map to probability at Warwick at Books [1] D. Williams (1991) Probability with Martingales CUP. 1.2 Resources (including examination information) The course is composed of 30 lectures, valued at 12 CATS credit. It has an assessed component (20%) as well as an examination in the summer term. The assessed component will be conducted as follows: an exercise sheet will be handed out approximately every fortnight, totalling 4 sheets. In the 10 minutes at the start of the next lecture you produce an answer to one question under examination conditions, specified at the start of the lecture. Model answers will be distributed 1

2 after the test, and an examples class will be held a week after the test. The tests will be marked, and the assessed component will be based on the best 3 out of 4 of your answers. This method helps you learn during the lecture course so should: improve your exam marks; increase your enjoyment of the course; cost less time than end-of-term assessment. Further copies of exercise sheets (after they have been handed out in lectures!) can be obtained at the homepage for the ST213 course: These notes will also be made available at the above URL, chapter by chapter as they are covered in lectures. Notice that they do not cover all the material of the lectures: their purpose is to provide a basic skeleton of summary material to supplement the notes you make during lectures. For example no proofs are included. In particular you will not find it possible to cover the course by ignoring lectures and depending on these notes alone! Further related material (eg: related courses, some pretty pictures of random processes,...) can be obtained by following links from: W.S. Kendall s homepage: Finally, the Library Student Reserve Collection (SRC) will in the summer term hold copies of previous examination papers, and we will run two revision classes for this course at that time. 1.3 Motivating Examples Here are some examples to help us see what are the issues. (1) J. Bernoulli (circa 1692): Suppose that A 1, A 2,... are mutually independent events, each of which has probability p. Define S n = #{ events A k which happen for k n}. Then the probability that S n /n is close to p increases to 1 as n tends to infinity: P [ S n /n p ɛ ] 1 as n for all ɛ > 0. (2) Suppose the random variable U is uniformly distributed over the continuous range [0, 1]. Why is it that for all x in [0, 1] we have P [ U = x ] = 0 and yet P [ a U b ] = b a 2

3 whenever 0 a b 1? Why can t we argue as follows? P [ a U b ] = P = x [a,b] x [a,b] {x} P [ U = x ] = 0? (3) The Banach-Tarski paradox. Consider a sphere S 2. In a certain qualified sense it is possible to do the following curious thing: we can find a subset F S 2 and (for any k 3) rotations τ k 1, τ k 2,..., τ k k such that S 2 = τ k 1 F τ k 2 F... τ k k F. What then should we suppose the surface area of F to be? Since S 2 = τ 3 1 F τ 3 2 F τ 3 3 F we can argue for area(f ) = 1/3. But since S 2 = τ 4 1 F τ 4 2 F τ 4 3 F τ 4 4 F we can equally argue for area(f ) = 1/4. Or similarly for area(f ) = 1/5. Or 1/6, or... (4) Reverting to Bernoulli s example (Example 1 above) we could ask, what is the probability that, when we look at the whole sequence S 1 /1, S 2 /2, S 3 /3,..., we see the sequence tends to p? Is this different from Bernoulli s statement? (5) Here is a question which is apparently quite different, which turns out to be strongly related to the above ideas! Can we generalize the idea of a Riemann integral in such a way as to make sense of rather discontinuous integrands, such as the case given below? where 1 0 f(x) dx f(x) = { 1 when x is a rational number, 0 when x is an irrational number. 2. Probabilities, algebras, and σ-algebras 2.1 Motivation Consider two coins A and B which are tossed in the air so as each to land with either heads or tails upwards. We do not assume the coin-tosses are independent! 3

4 It is often the case that one feels justified in assuming the coins individually are equally likely to come up heads or tails. Using the fact P [ A = T ] = 1 P [ A = H ], etc, we find P [ A comes up heads ] = 1 2 P [ B comes up heads ] = 1 2 To find probabilities such as P [ HH ] = P [ A = H, B = H ] we need to say something about the relationship between the two coin-tosses. It is often the case that one feels justified in assuming the coin-tosses are independent, so P [ A = H, B = H ] = P [ A = H ] P [ B = H ]. However this assumption may be unwise when the person tossing the coin is not experienced! We may decide that some variant of the following is a better model: the event determining [B = H] is C if [A = H], D if [A = T ], where P [ C = H ] = 3 4 P [ D = H ] = 1 4 and A, C, D are independent. There are two stages of specification at work here. Given a collection C of events, and specified probabilities P [ C ] for each C C, we can find P [ C c ] = 1 P [ C ] the probability of the complement C c of C, but not necessarily P [ C D ] for C, D C. 2.2 Revision of sample space and events Remember from ST111 that we can use notation from set theory to describe events. We can think of events as subsets of sample space Ω. If A is an event, then the event that A does not happen is the complement or complementary event A c = {ω Ω : ω A}. If B is another event then the event that both A and B happen is the intersection A B = {ω Ω : ω A and ω B}. The event that either A or B (or both!) happen is the union A B = {ω Ω : ω A or ω B}. 2.3 Algebras of sets This leads us to identify classes of sets for which we want to find probabilities. 4

5 Definition 2.1 (Algebra of sets): An algebra (sometimes called a field) of subsets of Ω is a class C of subsets of a sample space Ω satisfying: (1) closure under complements: if A C then A c C; (2) closure under intersections: if A, B C then A B C; (3) closure under unions: if A, B C then A B C. Definition 2.2 (Algebra generated by a collection): If C is a collection of subsets of Ω then A(C), the algebra generated by C, is the intersection of all algebras of subsets of Ω which contain C. Here are some examples of algebras: (i) the trivial algebra A = {Ω, }; (ii) supposing Ω = {H, T }, another example is A = {Ω = {H, T }, {H}, {T }, } ; (iii) now consider the following class of subsets of the unit interval [0, 1]: A = { finite unions of subintervals }; This is an algebra. For example, if A = (a 0, a 1 ) (a 1, a 2 )... (a 2n, a 2n+1 ) is a non-overlapping union of intervals (and we can always re-arrange matters so that any union of intervals to be non-overlapping!) then A c = [0, a 0 ] [a 1, a 2 ]... [a 2n+1, 1]. This checks point (1) of the definition of an algebra of sets. Point (2) is rather easy, and point (3) is defined by points (1) and (2). (iv) Consider A = {{1, 2, 3}, {1, 2}, {3}, }. This is an algebra of subsets of Ω = {1, 2, 3}. Notice it does not include events such as {1}, {2, 3}. (v) Just to give an example of a collection of sets which is not an algebra, consider {{1, 2, 3}, {1, 2}, {2, 3}, }. (vi) Algebras get very large. It is typically more convenient simply to give a collection C of sets generating the algebra. For example, if C = then A(C) = {, Ω} is the trivial algebra described above! (vii) If Ω = {H, T } and C = {{H}} then A = {{H, T }, {H}, {T }, } as in example (ii) above. (viii) If Ω = [0, 1] and C = { intervals in [0, 1] } then A(C) is the collection of finite unions of intervals as in example (iii) above. (ix) Finally, if Ω = {H, T } and C is the collection of points in [0, 1] then A(C) is the collection of (a) all finite sets in [0, 1] and (b) all complements of finite sets in [0, 1]. 5

6 In realistic examples algebras are rather large : not surprising, since they correspond to the collection of all true-or-false statements you can make about a certain experiment! (If your experiment s results can be summarised as n different yes / no answers such as, result is hot/cold, result is coloured black/white, etc then the relevant algebra is composed of 2 n different subsets!) Therefore it is of interest that the typical element of an algebra can be written down in a rather special form: Theorem 2.3 (Representation of typical element of algebra): If C is a collection of subsets of Ω then the event A belongs to the algebra A(C) generated by C if and only if A = N M i i=1 j=1 where for each i, j either C i,j or its complement Ci,j c belongs to C. Moreover we may write A in this form with the sets C i,j D i = M i j=1 C i,j being disjoint. * We are now in a position to produce our first stab at a set of axioms for probability. Given a sample space and an algebra A of subsets, probability P [ ] assigns a number between 0 and 1 to each event in the algebra A, obeying the rules given below. There is a close analogy to the notion of length of subsets of [0, 1] (and also to notions of area, volume,...): the table below makes this clear: Probability Length of subset of [0, 1] P [ ] = 0 Length ( ) = 0 P [ Ω ] = 1 Length ([0, 1]) = 1 P [ A B ] = P [ A ] + P [ B ] Length ([a, b] [c, d]) = Length ([a, b]) + Length ([c, d]) if A B = if a b < c d * This result corresponds to a basic remark in logic: logical statements, however complicated, can be reduced to statements of the form (A 1 and A 2 and... and A m ) or (B 1 and B 2 and... and B n ) or... or (C 1 and C 2 and... and C p ), where the statements A 1 etc are either basic statements or their negations, and no more than one of the (...) or... or (...) can be true at once. 6

7 There are some consequences of these axioms which are not completely trivial. For example, the law of negation P [ A c ] = 1 P [ A ] ; and the generalized law of addition holding when A B is not necessarily empty P [ A B ] = P [ A ] + P [ B ] P [ A B ] (think of double-counting ); and finally the inclusion-exclusion law P [ A 1 A 2... A n ] = i P [ A i ] i j P [ A i A j ] ( 1) n P [ A 1 A 2... A n ]. 2.4 Limit Sets Much of the first half of ST111 is concerned with calculations using these various rules of probabilistic calculation. Essentially the representation theorem above tells us we can compute the probability of any event in A(C) just so long as we know the probabilities of the various events in C and also of all their intersections, whether by knowing events are independent or whether by knowing various conditional probabilities. * However these calculations can become long-winded and ultimately either infeasible or unrevealing. It is better to know how to approximate probabilities and events, which leads us to the following kind of question: Suppose we have a sequence of events C n which are decreasing (getting harder and harder to satisfy) and which converge to a limit C: C n C. Can we say P [ C n ] converges to P [ C ]? Here is a specific example. Suppose we observe an infinite sequence of coin tosses, and think therefore of the collection C of events A i that the i th coin comes up heads. Consider the probabilities (a) P [ second toss gives heads ] = P [ A 2 ] * We avoid discussing conditional probabilities here for reasons of shortage of time: they have been dealt with in ST111 and figure very largely in 7

8 (b) P [ first n tosses all give heads ] = P [ n i=1 A i ] (c) P [ the first toss which gives a head is even-numbered ] There is a difference! The first two can be dealt with within the algebra. The third cannot: suppose C n is the event the first toss in numbers 1,..., n which gives a head is even-numbered or else all n of these tosses give tails, then C n lies in A(C), and converges down to the event C the first toss which gives a head is even-numbered, but C is not in A(C). We now find a number of problems raise their heads. Problems with everywhere being impossible : Suppose we are running an experiment with an outcome uniformly distributed over [0, 1]. Then we have a problem as mentioned in the second of our motivating examples: under reasonable conditions we are working with the algebra of finite unions of sub-intervals of [0, 1], and the probability measure which gives P [ [a, b] ] = b a, but this means P [ {a} ] = 0. Now we need to be careful, since if we rashly allow ourselves to work with uncountable unions we get P x [0,1] {x} = x [0,1] 0 = 0. But this contradicts P [ [0, 1] ] = 1 and so is obviously wrong. Problems with specification: if we react to the above example by insisting we can only give probabilities to events in the original algebra, then we can fail to give probabilities to perfectly sensible events, such as in examples such as in (c) in the infinite sequence of coin-tosses above. On the other hand if we rashly prescribe probabilities then how can we avoid getting into contradictions such as above? It seems sensible to suppose that at least when we have C n C then we should be allowed to say P [ C n ] P [ C ], and this turns out to be the case as long as the set-up is sensible. Here is an example of a set-up which is not sensible: Ω = {1, 2, 3,...}, C = {{1}, {2},...}, P [ n ] = 1/2 n+1. Then A(C) is the collection of finite and co-finite* subsets of the positive integers, and P [ {1, 2,..., n} ] = n 1/2 m+1 = (1/2) (1 1/2 n+1 ) 1/2 1. m=1 We must now investigate how we can deal with limit sets. * co-finite: complement is finite 8

9 2.5 σ-algebras The first task is to establish a wide range of sensible limit sets. Boldly, we look at sets which can be obtained by any imaginable combination of countable set operations: the collection of all such sets is a σ-algebra.** Definition 2.4 (σ-algebra): A σ-algebra of subsets of Ω is an algebra which is also closed under countable unions. In fact σ-algebras are even larger than ordinary algebras; it is difficult to describe a typical member of a σ-algebra, and it pays to talk about σ-algebras generated by specified collections of sets. Definition 2.5 (σ-algebra generated by a collection): For any collection of subsets C of Ω, we define σ(c) to be the intersection of all σ-algebras of subsets of Ω which contain C: σ(c) = {S : S is a σ-algebra and C S}. Theorem 2.6 (Monotone limits): Note that σ(c) defined above is indeed a σ-algebra. Furthermore, it is the smallest σ-algebra containing C which is closed under monotone limits. Examples of σ-algebras include: all algebras of subsets of finite sets (because then there will be no non-finite countable set operations); the Borel σ-algebra generated by the family of all intervals of the real line; the σ-algebra for the coin-tossing example generated by the infinite family of events A i = [ i th coin is heads ]. 2.6 Countable additivity Now we have established a context for limit sets (they are sets belonging to a σ- algebra) we can think about what sort of limiting operations we should allow for probability measures. Definition 2.7 (Measures): A set-function µ : A [0, ] is said to be a finitely-additive measure if it satisfies: (FA) µ(a B) = µa + µb whenever A, B are disjoint. It is said to be countably-additive (or σ-additive) if in addition (CA) µ i=1 A i = i=1 µa i whenever the A i are disjoint and their union i=1 A i lies in A. We abbreviate finitely-additive to (FA), countably-additive to (CA). We often abbreviate countably-additive measure to measure. Notice that if A were actually a σ-algebra then we wouldn t have to check the condition i=1 A i lies in A in the third property. ** σ stands for countable 9

10 Definition 2.8 (Probability measures): A set-function P : A [0, 1] is said to be a finitely-additive probability measure if it is a (FA) measure such that P [ Ω ] = 1. It is a (CA) probability measure (we often just say probability measure if in addition it is (CA). Notice various consequences for probability measures: µ( ) = 0, condition (ii) follows from condition (iii) if condition (iii) holds, we always have µ ( i=1 (A i)) i=1 µ(a i) even when the union is not disjoint, etc. CA is a kind of continuity condition. A similar continuity condition is that of monotone limits. Definition 2.9 (Monotone limits): A set-function µ : A [0, 1] is said to obey the monotone limits property (ML) if it satisfies: µa i µa i whenever the A i increase upwards to a limit set A which lies in A. (ML) is simpler to check than (CA) but is equivalent for finitely-additive measures. Theorem 2.10 (Equivalence for countable additivity): (F A) + (ML) (CA) Lemma 2.11 (Another equivalence): Suppose P is a finitely additive probability measure on (Ω, F), where F is an algebra of sets. Then P is countably additive if and only if lim n P [ A n ] = 1 whenever the sequence of events A n belongs to the algebra F and moreover A n Ω. 2.7 Uniqueness of probability measures To illustrate the next step, consider the notion of length/area. (To avoid awkward alternatives, we talk about the measure instead of length/area /volume/...) It is easy to define the area of very regular sets. But for a stranger, more fractal-like, set A we would need to define something like an outer-measure { } µ(a) = inf µ(bi ) : where the B i cover A to get at least an upper bound for what it would be sensible to call the measure of A. Of course we must give equal priority to considering what is the measure of the complement A c. Suppose for definiteness that A is contained in a simple set 10

11 Q of finite measure (a convenient interval for length, a square for area, a cube for volume,...) so that A c = Q \ A. Then consideration of µ(a c ) leads us directly to consideration of inner-measure for A: µ(a) = µ(q) µ(a c ). Clearly µ(a) µ(a): moreover we can only expect a truly sensible definition of measure on the set F = { A : µ(a) = µ(a) }. The fundamental theorem of measure theory states that this works out all right! Theorem 2.12 (Extension theorem): If µ is a measure on an algebra A which is σ-additive on A then it can be extended uniquely to a countable additive measure on F defined as above: moreover σ(a) F. The proof of this remarkable theorem is too lengthy to go into here. Notice that it can be paraphrased very simply: if your notion of measure (probability, length, area, volume,...) can be defined consistently on an algebra in such a way that it is σ-additive whenever the two sides ( ) µ A i = µa i i=1 make sense (whenever the disjoint union i=1 A i actually belongs to the algebra), then it can be extended uniquely to the (typically much larger) σ-algebra generated by the original algebra, so as again to be a (σ-additive) measure. There is an important special part of this theorem which is worth stating separately. Definition 2.13 (Π-system): A Π-system of subsets of Ω is a collection of subsets including Ω itself and closed under finite intersections. Theorem 2.14 (Uniqueness for probability measures): Two finite measures which agree on a π-system Π also agree on the generated σ-algebra σ(π). i=1 2.8 Lebesgue measure and coin tossing The extension theorem can be applied to the uniform probability space Ω = [0, 1], A given by finite unions of intervals, P given by lengths of intervals. It turns out P is indeed σ-additive on A (showing this is non-trivial!) and so the extension theorem tells us there is a unique countably additive extension P on 11

12 the σ-algebra B = σ(a) (the Borel σ-algebra restricted to [0, 1]). We call this Lebesgue measure. There is a significant connection between infinite sequences of coin tosses and numbers in [0, 1]. Briefly, we can expand a number x [0, 1] in binary (as opposed to decimal!): we write x as.ω 1 ω 2 ω 3... where ω i equals 1 or 0 according as 2 i x is greater than 1 or not. The coin-tossing σ-algebra can be viewed as generated by the sequence {ω 1, ω 2, ω 3,...} with 0 standing for tails, 1 for heads. In effect we get a map from coin-tossing space 2 N to number space [0, 1] with the slight cautionary note that this map very occasionally maps two sequences onto one number (think of and ). In particular [ω 1 = a 1, ω 2 = a 2,..., ω d = a d ] = [x, x + 2 d ) where x is the number corresponding to (a 1, a 2,..., a d ). Remarkably, we can now use the uniqueness theorem to show that the map T : (a 1, a 2,..., a d ) x preserves probabilities, in the sense that Lebesgue measure is exactly the same as we get by finding the probability of the event T 1 (A) as a coin-tossing event, if the coins are independent and fair. It is reasonable to ask whether there are any non-measurable sets, since σ- algebras are so big! It is indeed very hard to find any. Here is the basic example, which is due in essence to Vitali. Consider the following equivalence relation on (Ω,B,P): we say x y if x y is a rational number. Now construct a set A by choosing exactly one member from each equivalence class. So for any x [0, 1] there is one and only one y A such that x y is a rational number. If A were Lebesgue measurable then it would have a value P [ A ]. What would this value be? Imagine [0, 1] folded round into a circle. It is the case that P [ A ] does not change when one turns this circle. In particular we can now consider A q = {a + q : a A} for rational q. By construction A q and A r are disjoint for different rational q, r. Now we have A q = [0, 1] q rational 12

13 and since there are only countably many rational q, and P [ A q ] doesn t depend on q, we determine P [ [0, 1] ] = P [ A q ] = P [ A ]. q rational q rational But this cannot make sense if P [ [0, 1] ] = 1! We are forced to conclude that A cannot be Lebesgue measurable. This example has a lot to do with the Banach-Tarski paradox described in one of our motivating examples above. 3. Independence and measurable functions 3.1 Independence In ST111 we formalized the idea of independence of events. Essentially we require a multiplication law to hold: Definition 3.15 (Independence of an infinite sequence of events): We say the events A i (for i = 1, 2,...) are independent if, for any finite subsequence i 1 < i 2 <... < i k we have P [ A i1... A ik ] = P [ A i1 ]... P [ A ik ] Notice we require all possible multiplication laws to hold: it is possible to build interesting examples where events are independent pair-by-pair, but altogether give non-trivial information about each other. We need to talk about infinite sequences of events (often independent). We often have in the back of our minds a sense that the sequence is revealed to us progressively over time (though this need not be so!), suggesting two natural questions. First, will we see events occur in the sequence right into the indefinite future? Second, will we after some point see all events occur? Definition 3.16 ( Infinitely often and Eventually ): Given a sequence of events B 1, B 2,... we say B i holds infinitely often ([B i i.o.]) if there are infinitely many different i for which the statement B i is true: in set-theoretic terms [B i i.o.] = i=1 j=i B j. 13

14 B i holds eventually ([B i ev.]) if for all large enough i the statement B i is true: in set-theoretic terms [B i ev.] = B j. i=1 j=i Notice these two concepts ev. and i.o. make sense even if the infinite sequence is just a sequence, with no notion of events occurring consecutively in time! Notice (you should check this yourself!) [B i i.o.] = [B c i ev.] c. 3.2 Borel-Cantelli lemmas The multiplication laws appearing above in Section 2.1 force a kind of infinite multiplication law. Lemma 3.17 (Probability of infinite intersection): If the events A i (for i = 1, 2,...) are independent then [ ] P A i = P [ A i ] i=1 We have to be careful what we mean by the infinite product i=1 P [ A i ]: we mean of course the limiting value lim n n i=1 P [ A i ]. We can now prove a remarkable pair of facts about P [ A i i.o. ] (and hence its twin P [ A i ev. ]!). It turns out it is often easy to tell whether these events have probability 0 or 1. Theorem 3.18 (Borel-Cantelli lemmas): Suppose the events A i (for i = 1, 2,...) form an infinite sequence. Then (i) if i=1 P [ A i ] < then P [ A i holds infinitely often ] = P [ A i i.o. ] = 0 ; (ii) if i=1 P [ A i ] = and the A i are independent then i=1 P [ A i holds infinitely often ] = P [ A i i.o. ] = 1. Note the two parts of the above result are not quite symmetrical: the second part also requires independence. It is a good exercise to work out a counterexample to part (ii) if independence fails. 3.3 Law of large numbers for events As a consequence of these ideas it can be shown that limiting frequencies exist for sequences of independent trials with the same success probability. 14

15 Theorem 3.19 (Law of large numbers for events): Suppose that we have a sequence of independent events A i each with the same probability p. Let S n count the number of events A 1,...,, A n which occur. Then for all positive ɛ. [ ] S n P n p ɛ ev. = Independence and classes of events The idea of independence stretches beyond mere sequences of events. For example, consider (a) a set of events concerning a football match between Coventry City and Aston Villa at home for Coventry, and (b) a set of events concerning a cricket test between England and Australia at Melbourne, both happening on the same day. At least as a first approximation, one might assume that any combination of events concerning (a) is independent of any combination concerning (b). Definition 3.20 (Independence and classes of events): Suppose C 1, C 2 are two classes of events. We say they are independent if A and B are independent whenever A C 1, B C 2. Here our notion of Π-systems becomes important. Lemma 3.21 (Independence and Π-systems): If two Π-systems are independent, then so are the σ-algebras they generate. Returning to sequences, the above is the reason why we can jump immediately from assumptions of independence of events to deducing that their complements are independent. Corollary 3.22 (Independence and complements): If a sequence of events A i is independent, then so is the sequence of complementary events A c i. 3.5 Measurable functions Mathematical work often becomes easier if one moves from sets to functions. Probability theory is no different. Instead of events (subsets of sample space) we can often find it easier to work with random variables (real-valued functions defined on sample space). You should think of a random variable as involving lots of different events, namely those events defined in terms of the random variable taking on different sets of values. Accordingly we need to take care that the random variable doesn t produce events which fall outwith our chosen σ-algebra. To do this we need to develop the idea of a measurable function. 15

16 Definition 3.23 (Measurable space): (Ω, F) is a measurable space if F is a σ-algebra of subsets of Ω. Definition 3.24 (Borel σ-algebra): The Borel σ-algebra B is the σ-algebra of subsets of R generated by the collection of intervals of R. In fact we don t need all the intervals of R. It is enough to take the closed half-infinite intervals (, x]. Definition 3.25 (Measurable function): Suppose that (Ω, F), (Ω, F ) are both measurable spaces. We say the function f : Ω Ω is measurable if f 1 (A) = {ω : f(ω) A} belongs to F whenever A belongs to F. Definition 3.26 (Random variable): Suppose that X : Ω R is measurable as a mapping from (Ω, F) to (R, B). Then we say X is a random variable. As we have said, to each random variable there is a class of related events. This actually forms a σ-algebra. Definition 3.27 (σ-algebra generated by a random variable): If X : Ω R is a random variable then the σ-algebra generated by X is the family of events σ(x) = {X 1 (A) : A B}. 3.6 Independence of random variables Random variables can be independent too! Essentially here independence means that a event generated by one of the random variables cannot be used to give useful predictions about an event generated by the other random variable. Definition 3.28 (Independence of random variables): We say random variables X and Y are independent if their σ-algebras σ(x), σ(y ) are independent. Theorem 3.29 (Criterion for independence of random variables): Let X and Y be random variables, and let P be the Π-system of R formed by all halfinfinite closed intervals (, x]. Then X and Y are independent if and only if the collections of events X 1 P, Y 1 P are independent*. 3.7 Distributions of random variables We often need to talk about random variables on their own, without reference to other random variables or events. In such cases all we are interested in is the probabilities they have of taking values in various regions: * Here we define X 1 P = {X 1 (A) : A P} = {X 1 ((, x]) : x R} 16

17 Definition 3.30 (Distribution of a random variable): Suppose that X is a random variable. Its distribution is the probability measure P X on R given by whenever B B. P X [B] = P [ X B ] 4. Integration One of the main things to do with functions is to integrate them (find the area under the curve). One of the main things to do with random variables is to take their expectations (find their average values). It turns out that these are really the same idea! We start with integration. 4.1 Simple functions and Indicators Begin by thinking of the simplest possible function to integrate. That is an indicator function, which only takes two possible values, 0 or 1: Definition 4.31 (Indicator function): If A is a measurable set then its indicator function is defined by I [A] (x) = { 0 if x A; 1 if x A. The next stage up is to consider a simple function taking only a finite number of values, since it can be regarded as a linear combination of indicator functions. Definition 4.32 (Simple functions): A simple function h is a measurable function h : Ω R which only takes finitely many values. Thus we can represent it as h(x) = c 1 I [A1 ](x) +...c n I [An ](x) for some finite collection A 1,..., A n of measurable sets and constants c 1,..., c n. It is easy to integrate simple functions... Definition 4.33 (Integration of simple functions): The integral of a simple function h with respect to a measure µ is given by h dµ = h(x)µ( dx) = n c i µ(a i ) i=1 where h(x) = c 1 I [A1 ](x) +...c n I [An ](x) 17

18 as above. Note that one really should prove that the definition of h dµ does not depend on exactly how one represents h as the sum of indicator functions. Integration for such functions has a number of basic properties which one uses all the time, almost unconsciously, when trying to find integrals. Theorem 4.34 (Properties of integration for simple functions): (1) if µ(f g) = 0 then f dµ = g dµ; (2) Linearity: (af + bg) dµ = a f dµ + b g dµ; (3) Monotonicity: f g means f dµ g dµ; (4) min{f, g} and max{f, g} are simple. Simple functions are rather boring. For more general functions we use limiting arguments. We have to be a little careful here, since some functions will have integrals built up from + where they are integrated over one part of the region, and over another part. Think for example of 1 x dx = x dx + 1 dx equals? x So we first consider just non-negative functions. Definition 4.35 (Integration for non-negative measurable functions): If f 0 is measurable then we define { f dµ = sup } g dµ : for simple g such that 0 g f. 4.2 Integrable functions For general functions we require that we don t get into this situation of. Definition 4.36 (Integration for general measurable functions): If f is measurable and we can write f = g h for two non-negative measurable functions g and h, both with finite integrals, then f dµ = g dµ h dµ. We then say f is integrable. 18

19 One really needs to prove that the integral f dµ does not depend on the choice f = g h. In fact if there is any choice which works then the easy choice g = max{f, 0} h = max{ f, 0} will work. One can show that the integral on integrable functions agrees with its definition on simple functions and is linear. What starts to make the theory very easy is that the integral thus defined behaves very well when studying limits. Theorem 4.37 (Monotone convergence theorem (MON)): If f n f (all being non-negative measurable functions) then f n dµ f dµ. Corollary 4.38 (Integrability and simple functions): if f is non-negative and measurable then for any sequence of non-negative simple functions f n such that f n f we have f n dµ f dµ. Definition 4.39 (Integration over a measurable set): if A is measurable and f is integrable then f dµ = (I[A]f ) dµ. A 4.3 Expectation of random variables The above notions apply directly to random variables, which may be thought of simply as measurable functions defined on the sample space! Definition 4.40 (Expectation): if P is a probability measure then we define expectation (with respect to this probability measure) for all integrable random variables X by E [ X ] = X dp = X(ω)P( dω). The notion of expectation is really only to do with the random variable considered on its own, without reference to any other random variables. Accordingly it can be expressed in terms of the distribution of the random variable. 19

20 Theorem 4.41 (Change of variables): Let X be a random variable and let g : R R be a measurable function. Assuming that the random variable g(x) is integrable, E [ g(x) ] = g(x)p X ( dx). R 4.4 Examples You need to work through examples such as the following to get a good idea of how the above really works out in practice. See the material covered in lectures for more on this. Evaluate 1 xleb( dx) = x. 0 Consider Ω = {1, 2, 3,...}, P [ {i} ] = p i where i=1 p i = 1. Evaluate f dp = i=1 f(i)p i. Evaluate y 0 ex Leb( dx). Evaluate n f(x)leb( dx) where 0 1 if 0 x < 1, 2 if 1 x < 2, f(x) =... n if n 1 x < n. Evaluate I [0,θ] (x) sin(x)leb( dx). 5. Convergence Approximation is a fundamental key to making mathematics work in practice. Instead of being stuck, unable to do a hard problem, we find an easier problem which has almost the same answer, and do that instead! The notion of convergence (see first-year analysis) is the formal structure giving us the tools to do this. For random variables there are a number of different notions of convergence, depending on whether we need to approximate a whole sequence of actual random values, or just a particular random value, or even just probabilities. 5.1 Convergence of random variables Definition 5.42 (Convergence in probability): The random variables X n converge in probability to Y, X n Y in prob, 20

21 if for all positive ɛ we have P [ X n Y > ɛ ] 0. Definition 5.43 (Convergence almost surely / almost everywhere): The random variables X n converge almost surely to Y, if we have X n Y a.s., P [ X n Y ] = 0. The (measurable) functions f n converge almost everywhere to f if the set is of Lebesgue measure zero. {x : f n (x) f(x) fails } The difference is that convergence in probability deals with just a single random value X n for large n. Convergence almost surely deals with the behaviour of the whole sequence. Here are some examples to think about. Consider random variables defined on ([0, 1], B, Leb) by X n (ω) = I [[0,1/n]] (ω), Then X n 0 a.s.. Consider the probability space above and the events A 1 = [0, 1], A 2 = [0, 1/2], A 3 = [1/2, 1], A 4 = [0, 1/4],..., A 7 = [3/4, 1],... Then X n = I [An ] converges to zero in probability but not almost surely. Suppose in the above that X n = n k=1 (k/n)i [[(k 1)/n,k/n]]. Then X n X a.s., where X(ω = ω [0, 1]. Suppose in the above that X n a for all n. Let Y n = max m n X m. Then Y n Y a.s. for some Y. Suppose in the above that the X n are not bounded, but are independent, and furthermore Then Y n Y a.s. where lim a i=1 P [ Y a ] = P [ X n a ] = 1. P [ X n a ]. As one might expect, the notion of almost sure convergence implies that of convergence in probability. i=1 21

22 Theorem 5.44 (Almost sure convergence implies convergence in probability): X n X a.s. implies X n X in prob. ALmost sure convergence allows for various theorems telling us when it is OK to exchange integrals and limits. Generally this doesn t work: consider the example 1 = 0 λ exp( λt) dt 0 lim λ λ exp( λt) dt = 0 dt = 0. However we have already seen one case where it does work: when the limit in monotonic. In fact we only need this to hold almost everywhere (i.e. when the convergence is almost sure). Theorem 5.45 (MON): if the functions f n, f are non-negative and if f n f µ a.e. then f n dµ f dµ. It is often the case that the following simple inequalities are crucial to figuring out whether convergence holds. Lemma 5.46 (Markov s inequality): if f : R R is increasing and nonnegative and X is a random variable then for all a such that f(a) > 0. P [ X a ] E [ f(x) ] /f(a) Corollary 5.47 (Chebyshev s inequality): if E [ X 2 ] < then for all a > 0. P [ X E [ X ] a ] Var(X)/a 2 In particular we can get a lot of mileage by combining with the fact, that while in general the variance of a random variable is not additive, it is additive in the case of independence. Lemma 5.48 (variance and independence): if a sequence of random variables X i is independent then ( n ) n Var X i = Var (X i ). i=1 i=1 5.2 Laws of large numbers for random variables An important application of these ideas is to show that the law of large numbers extends from events to random variables. 22

23 Theorem 5.49 (Weak law of large numbers): if a sequence of random variables X i is independent, and if the random variables all have the same finite mean and variance E [ X i ] = µ and Var(X i ) = σ 2 <, then S n /n µ in prob. where S n = (X X n )/n is the partial sum of the sequence. As you will see, the proof is really rather easy when we use Chebyshev s inequality above. Indeed it is also quite easy to generalize to the case when the random variables are correlated, as long as the covariances are small... However the corresponding result for almost sure convergence, rather than convergence, is rather harder to prove. Theorem 5.50 (Strong law of large numbers): if a sequence of random variables X i is independent and identically distributed, and if E [ X i ] = µ then S n /n µ a.s. where S n = (X X n )/n is the partial sum of the sequence. 5.3 Convergence of integrals and expectations We already know a way to relate integrals to limits (MON). What about a general sequence of non-negative measurable functions? Theorem 5.51 (Fatou s lemma (FATOU)): If the functions f n : R R are actually non-negative then lim inf f n dµ lim inf f n dµ. We can also go the other way : Theorem 5.52 ( Reverse Fatou ): If the functions f n : R R are bounded above by g µ a.e. and g is integrable then lim sup f n dµ lim sup f n dµ. 5.4 Dominated convergence theorem Although in general one can t interchange limits and integrals, this can be done if all the functions (equivalently, random variables) involved are bounded in absolute value by a single non-negative function (random variable) which has finite integral. 23

24 Corollary 5.53 (Dominated convergence theorem (DOM)): If the functions f n : R R are bounded above in absolute value by g µ a.e. (so f n < g a.e.) and g is integrable and also f n f then lim f n dµ = f dµ. This is a very powerful result Examples If the X n form a bounded sequence random variable and they converge almost surely to X then E [ X n ] E [ X ]. Suppose that U is a random variable uniformly distributed over [0, 1] and X n = 2 n 1 k=0 k2 n I [k2 n U<(k+1)2 n ]. Then E [ log(1 X n ) ] 1. Suppose that the X n are independent and X 1 = 1 while for n 2 P [ X n = n + 1 ] = P [ X n = 1/(n + 1) ] = 1/n 3 P [ X n = 1 ] = 1 2/n 3 and Z n = n i=1 X i. Then the Z n form an almost surely convergent sequence with limit Z, and E [ Z n ] = E [ Z ]. 6. Product measures 6.1 Product measure spaces The idea here is, given two measure spaces (Ω, F, µ) and (Ω, F, ν), we build a meaasure space Ω Ω by using rectangle sets A B with measures µ(a) ν(b). As you might guess from the product form µ(a) ν(b), in the context of probability this is related to independence. 24

25 Definition 6.54 (Product measure space): define the product measure µ ν on the Π-system R of rectangle sets A B as above. Let A(R) be the algebra generated by R. Lemma 6.55 (Representation of A(R)): every member of A(R) can be expressed as a finite disjoint union of rectangle sets. It is now possible to apply the Extension Theorem (we need to check σ- additivity this is non-trivial but works) to define the product measure µ ν on the whole σ-algebra σ(r). 6.2 Fubini s theorem There are three big results on integration. We have already met two: MON and DOM, which tell us cases when we can exchange integrals and limits. The other result arises in the situation where we have a product measure space. In such a case we can integrate any function in one of three possible ways: either using the product measure, or by first doing a partial integration holding one coordinate fixed, and then integrating with respect to that one. We call this alternative iterated integration, and obviously there are two ways to do it depending on which variable we fix first. The final big result is due to Fubini, and tells us that as long as the function is modestly well-behaved it doesn t matter which of the three ways we do the integration, we still get the same answer: Theorem 6.56 (Fubini s theorem): Suppose f is a real-valued function defined on the product measure space above which is either (a) non-negative or (b) µ ν- integrable. Then f d(µ ν) = Ω ( ) f(ω, ω )µ( dω) ν( dω ) Ω Notice the two alternative conditions. Non-negativity (sometimes described as Tonelli s condition, is easy to check but can be limited. Think carefully about Fubini s theorem and especially Tonelli s condition, and you will see that the only thing which can go wrong is when in the product form you have an problem! 6.3 Relationship with independence Suppose X and Y are independent random variables. Then the distribution of the pair (X, Y ), a measure on R R given by µ (A) = P [ (X, Y ) A ], is exactly the product measure µ ν where µ is the distribution of X, and ν is the distribution of Y. End of outline notes 25

Measure and integration

Chapter 5 Measure and integration In calculus you have learned how to calculate the size of different kinds of sets: the length of a curve, the area of a region or a surface, the volume or mass of a solid.