2 Conditioning. 1 Conditional Distributions

Size: px

Start display at page:

Download "2 Conditioning. 1 Conditional Distributions"

Bertha Fox
6 years ago
Views:

1 Conditioning 1 Conditional Distributions Let A and B be events, and suppose that P (B) >. We recall from Section 3 of the Introduction that the conditional probability of A given B is defined as P (A B) P (A B)/P (B) and that P (A B) P (A) if A and B are independent. Now, let (X, Y ) be a two-dimensional random variable whose components are discrete. Example 1.1. A symmetric die is thrown twice. Let U 1 be a random variable denoting the number of dots on the first throw, let U be a random variable denoting the number of dots on the second throw, and set X U 1 + U and Y minu 1, U }. Suppose we wish to find the distribution of Y for some given value of X, for example, P (Y X 7). Set A Y } and B X 7}. From the definition of conditional probabilities we obtain P (Y X 7) P (A B) P (A B) P (B) With this method one may compute P (Y y X x) for any fixed value of x as y varies for arbitrary, discrete, jointly distributed random variables. This leads to the following definition. Definition 1.1. Let X and Y be discrete, jointly distributed random variables. For P (X x) > the conditional probability function of Y given that X x is p Y Xx (y) P (Y y X x) p X,Y (x, y), p X (x) and the conditional distribution function of Y given that X x is A. Gut, An Intermediate course in Probabilty, Springer Texts in Statistics, DOI: 1.17/ _, Springer Science + Business Media, LLC 9 31

2 3 Conditioning F Y Xx (y) z y p Y Xx (z). Exercise 1.1. Show that p Y Xx (y) is a probability function of a true probability distribution. It follows immediately (please check) that and that p Y Xx (y) p X,Y (x, y) p X (x) F Y Xx (y) p X,Y (x, z) z y p X (x) p X,Y (x, y) p X,Y (x, z) z p X,Y (x, z) z y p X,Y (x, z). Exercise 1.. Compute the conditional probability function p Y Xx (y) and the conditional distribution function F Y Xx (y) in Example 1.1. Now let X and Y have a joint continuous distribution. Expressions like P (Y y X x) have no meaning in this case, since the probability that a fixed value is assumed equals zero. However, an examination of how the preceding conditional probabilities are computed makes the following definition very natural. Definition 1.. Let X and Y have a joint continuous distribution. For f X (x) >, the conditional density function of Y given that X x is f Y Xx (y) f X,Y (x, y) f X (x) and the conditional distribution function of Y given that X x is F Y Xx (y) y z f Y Xx (z) dz. In analogy with the discrete case, we further have, and f Y Xx (y) F Y Xx (y) y f X,Y (x, y) f X,Y (x, z) dz f X,Y (x, z) dz. f X,Y (x, z) dz

3 Conditional Expectation and Conditional Variance 33 Exercise 1.3. Show that f Y Xx (y) is a density function of a true probability distribution. Exercise 1.4. Find the conditional distribution of Y given that X x in Example and Exercise Exercise 1.5. Prove that if X and Y are independent then the conditional distributions and the unconditional distributions are the same. Explain why this is reasonable. Remark 1.1. Definitions 1.1 and 1. can be extended to situations with more than two random variables. How? Conditional Expectation and Conditional Variance In the same vein as the concepts of expected value and variance are introduced as convenient location and dispersion measures for (ordinary) random variables or distributions, it is natural to introduce analogs to these concepts for conditional distributions. The following example shows how such notions enter naturally. Example.1. A stick of length one is broken at a random point, uniformly distributed over the stick. The remaining piece is broken once more. Find the expected value and variance of the piece that now remains. In order to solve this problem we let X U(, 1) be the first remaining piece. The second remaining piece Y is uniformly distributed on the interval (, X). This is to be interpreted as follows: Given that X x, the random variable Y is uniformly distributed on the interval (, x): Y X x U(, x), that is, f Y Xx (y) 1/x for < y < x and, otherwise. Clearly, E X 1/ and Var X 1/1. Furthermore, intuition suggests that E(Y X x) x and Var(Y X x) x 1. (.1) We wish to determine E Y and Var Y somehow with the aid of the preceding relations. We are now ready to state our first definition. Definition.1. Let X and Y be jointly distributed random variables. The conditional expectation of Y given that X x is y p Y Xx (y) in the discrete case, y E(Y X x) y f Y Xx (y) dy in the continuous case, provided the relevant sum or integral is absolutely convergent.

4 34 Conditioning Exercise.1. Let X, Y, Y 1, and Y be random variables, let g be a function, and c a constant. Show that (a) E(c X x) c, (b) E(Y 1 + Y X x) E(Y 1 X x) + E(Y X x), (c) E(cY X x) c E(Y X x), (d) E(g(X, Y ) X x) E(g(x, Y ) X x), (e) E(Y X x) E Y if X and Y are independent. The conditional distribution of Y given that X x depends on the value of x (unless X and Y are independent). This implies that the conditional expectation E(Y X x) is a function of x, that is, E(Y X x) h(x) (.) for some function h. (If X and Y are independent, then check that h(x) E Y, a constant.) An object of considerable interest and importance is the random variable h(x), which we denote by h(x) E(Y X). (.3) This random variable is of interest not only in the context of probability theory (as we shall see later) but also in statistics in connection with estimation. Loosely speaking, it turns out that if Y is a good estimator and X is suitably chosen, then E(Y X) is a better estimator. Technically, given a so-called unbiased estimator U of a parameter θ, it is possible to construct another unbiased estimator V by considering the conditional expectation of U with respect to what is called a sufficient statistic T ; that is, V E(U T ). The point is that E U E V θ (unbiasedness) and that Var V Var U (this follows essentially from the sufficiency and Theorem.3 ahead). For details, we refer to the statistics literature provided in Appendix A. A natural question at this point is: What is the expected value of the random variable E(Y X)? Theorem.1. Suppose that E Y <. Then E ( E(Y X) ) E Y. Proof. We prove the theorem for the continuous case and leave the (completely analogous) proof for the discrete case as an exercise. E ( E(Y X) ) E h(x) h(x) f X (x) dx E(Y X x) f X (x) dx ( ) y f Y Xx (y) dy f X (x) dx

5 Conditional Expectation and Conditional Variance 35 y f X,Y (x, y) f X (x) dy dx f X (x) ( ) y f X,Y (x, y) dx dy y f X,Y (x, y) dy dx y f Y (y) dy E Y. Remark.1. Theorem.1 can be interpreted as an expectation version of the law of total probability. Remark.. Clearly, E Y must exist in order for Theorem.1 to make sense, that is, the corresponding sum or integral must be absolutely convergent. Now, given this assumption, one can show that E(E(Y X)) exists and is finite and that the computations in the proof, such as reversing orders of integration, are permitted. We shall, in the sequel, permit ourselves at times to be somewhat sloppy about such verifications. Analogous remarks apply to further results ahead. We close this remark by pointing out that the conclusion always holds in case Y is nonnegative, in the sense that if one of the members is infinite, then so is the other. Exercise.. The object of this exercise is to show that if we do not assume that E Y < in Theorem.1, then the conclusion does not necessarily hold. Namely, suppose that X Γ(1/, ) ( χ (1)) and that f Y Xx (y) 1 π x 1 e 1 xy, < y <. (a) Compute E(Y X x), E(Y X), and, finally, E(E(Y X)). (b) Show that Y C(, 1). (c) What about E Y? We are now able to find E Y in Example.1. Example.1 (continued). It follows from the definition that the first part of (.1) holds: E(Y X x) x, that is, h(x) x. An application of Theorem.1 now yields E Y E ( E(Y X) ) ( 1 ) E h(x) E X 1 E X We have thus determined E Y without prior knowledge about the distribution of Y. Exercise.3. Find the expectation of the remaining piece after it has been broken off n times.

6 36 Conditioning Remark.3. That the result E Y 1/4 is reasonable can intuitively be seen from the fact that X on average equals 1/ and that Y on average equals half the value of X, that is 1/ of 1/. The proof of Theorem.1 consists, in fact, of a stringent version of this kind of argument. Theorem.. Let X and Y be random variables and g be a function. We have (a) E ( g(x)y X ) g(x) E(Y X), and (b) E(Y X) E Y if X and Y are independent. Exercise.4. Prove Theorem.. Remark.4. Conditioning with respect to X means that X should be interpreted as known, and, hence, g(x) as a constant that thus may be moved in front of the expectation (recall Exercise.1(a)). This explains why Theorem.(a) should hold. Part (b) follows from the fact that the conditional distribution and the unconditional distribution coincide if X and Y are independent; in particular, this should remain true for the conditional expectation and the unconditional expectation (recall Exercises 1.5 and.1(e)). A natural problem is to find the variance of the remaining piece Y in Example.1, which, in turn, suggests the introduction of the concept of conditional variance. Definition.. Let X and Y have a joint distribution. The conditional variance of Y given that X x is Var(Y X x) E ( (Y E(Y X x)) X x ), provided the corresponding sum or integral is absolutely convergent. The conditional variance is (also) a function of x; call it v(x). The corresponding random variable is The following result is fundamental. v(x) Var(Y X). (.4) Theorem.3. Let X and Y be random variables and g a real-valued function. If E Y < and E ( g(x) ) <, then E ( Y g(x) ) E Var(Y X) + E ( E(Y X) g(x) ). Proof. An expansion of the left-hand side yields E ( Y g(x) ) E ( Y E(Y X) + E(Y X) g(x) ) E ( Y E(Y X) ) + E ( Y E(Y X) )( E(Y X) g(x) ) + E ( E(Y X) g(x) ).

7 Conditional Expectation and Conditional Variance 37 Using Theorem.1, the right-hand side becomes E E ( (Y E(Y X)) X ) + E E ( (Y E(Y X)) (E(Y X) g(x)) X ) + E ( E(Y X) g(x) ) E Var(Y X) + E (E(Y X) g(x)) E(Y E(Y X) X) } + E ( E(Y X) g(x) ) by Theorem.(a). Finally, since E(Y E(Y X) X), this equals E Var(Y X) + E (E(Y X) g(x)) } + E ( E(Y X) g(x) ), which was to be proved. The particular choice g(x) E Y, together with an application of Theorem.1, yields the following corollary: Corollary.3.1. Suppose that E Y <. Then Var Y E Var (Y X) + Var ( E(Y X) ). Example.1 (continued). Let us determine Var Y with the aid of Corollary.3.1. It follows from second part of formula (.1) that Var(Y X x) 1 1 x, and hence, v(x) 1 1 X, so that ( 1 E Var(Y X) E v(x) E 1 X) Furthermore, Var ( E(Y X) ) ( 1 ) Var(h(X)) Var X 1 4 Var(X) An application of Corollary.3.1 finally yields Var Y 1/36 + 1/48 7/144. We have thus computed Var Y without knowing the distribution of Y. Exercise.5. Find the distribution of Y in Example.1, and verify the values of E Y and Var Y obtained above. A discrete variant of Example.1 is the following: Let X be uniformly distributed over the numbers 1,,..., 6 (that is, throw a symmetric die) and let Y be uniformly distributed over the numbers 1,,..., X (that is, then throw a symmetric die with X faces). In this case, h(x) E(Y X x) 1 + x, from which it follows that ( ) 1 + X E Y E h(x) E 1 (1 + E X) 1 ( ).5. The computation of Var Y is somewhat more elaborate. We leave the details to the reader.

8 38 Conditioning 3 Distributions with Random Parameters We begin with two examples: Example 3.1. Suppose that the density X of red blood corpuscles in humans follows a Poisson distribution whose parameter depends on the observed individual. This means that for Jürg we have X Po(m J ), where m J is Jürg s parameter value, while for Alice we have X Po(m A ), where m A is Alice s parameter value. For a person selected at random we may consider the parameter value M as a random variable such that, given that M m, we have X Po(m); namely, P (X k M m) e m mk, k, 1,,.... (3.1) k! Thus, if we know that Alice was chosen, then P (X k M m A ) e ma m k A /k!, for k, 1,,..., as before. We shall soon see that X itself (unconditioned) need not follow a Poisson distribution. Example 3.. A radioactive substance emits α-particles in such a way that the number of emitted particles during an hour, N, follows a Po(λ)-distribution. The particle counter, however, is somewhat unreliable in the sense that an emitted particle is registered with probability p ( < p < 1), whereas it remains unregistered with probability q 1 p. All particles are registered independently of each other. This means that if we know that n particles were emitted during a specific hour, then the number of registered particles X Bin(n, p), that is, ( ) n P (X k N n) p k q n k, k, 1,..., n (3.) k (and N Po(λ)). If, however, we observe the process during an arbitrarily chosen hour, it follows, as will be seen below, that the number of registered particles does not follow a binomial distribution (but instead a Poisson distribution). The common feature in these examples is that the random variable under consideration, X, has a known distribution but with a parameter that is a random variable. Somewhat imprecisely, we might say that in Example 3.1 we have X Po(M), where M follows some distribution, and that in Example 3. we have X Bin(N, p), where N Po(λ). We prefer, however, to describe these cases as X M m Po(m) with M F, (3.3) where F is some distribution, and respectively. X N n Bin(n, p) with N Po(λ), (3.4)

9 3 Distributions with Random Parameters 39 Let us now determine the (unconditional) distributions of X in our examples, where, in Example 3.1, we assume that M Exp(1). Example 3.1 (continued). We thus have X M m Po(m) with M Exp(1). (3.5) By (the continuous version of) the law of total probability, we obtain, for k, 1,,..., P (X k) 1 k+1 P (X k M x) f M (x) dx x xk e 1 k ( 1 k! x k e x dx k! e x dx 1 Γ(k + 1) k+1 x k+1 1 e x dx ) k, that is, X Ge(1/). The unconditional distribution in this case thus is not a Poisson distribution; it is a geometric distribution. Exercise 3.1. Determine the distribution of X if M has (a) an Exp(a)-distribution, (b) a Γ(p, a)-distribution. Note also that we may use the formulas from Section to compute E X and Var X without knowing the distribution of X. Namely, since E(X M m) m (i.e., h(m) E(X M) M), Theorem.1 yields and Corollary.3.1 yields E X E ( E(X M) ) E M 1, Var X E Var(X M) + Var ( E(X M) ) E M + Var M If, however, the distribution has been determined (as above), the formulas from Section may be used for checking. If applied to Exercise 3.1(a), the latter formulas yield E X a and Var X a+a. Since this situation differs from Example 3.1 only by a rescaling of M, one might perhaps guess that the solution is another geometric distribution. If this were true, we would have E X a q p 1 p p 1 p 1; p 1 a + 1. This value of p inserted in the expression for the variance yields

10 4 Conditioning q p 1 p p 1 p 1 p (a + 1) (a + 1) a + a, which coincides with our computations above and provides the guess that X Ge(1/(a + 1)). Remark 3.1. In Example 3.1 we used the results of Section. to confirm our result. In Exercise 3.1(a) they were used to confirm (provide) a guess. We now turn to the α-particles. Example 3. (continued). Intuitively, the deficiency of the particle counter implies that the radiation actually measured is, on average, a fraction p of the original Poisson stream of particles. We might therefore expect that the number of registered particles during one hour should be a Po(λp)-distributed random variable. That this is actually correct is verified next. The model implies that X N n Bin(n, p) with N Po(λ). The law of total probability yields, for k, 1,,..., P (X k) P (X k N n) P (N n) n nk pk k! e λ (λp)k k! ( n )p k q n k λ λn e k n! nk e λ λ n (n k)! qn k (λp)k k! j (λq) j j! e λ nk (λq) n k (n k)! (λp)k e λ e λq λp (λp)k e, k! k! that is, X Po(λp). The unconditional distribution thus is not a binomial distribution; it is a Poisson distribution. Remark 3.. This is an example of a so-called thinned Poisson process. For more details, we refer to Section 8.6. Exercise 3.. Use Theorem.1 and Corollary.3.1 to check the values of E X and Var X. A family of distributions that is of special interest is the family of mixed normal, or mixed Gaussian, distributions. These are normal distributions with a random variance, namely, X Σ y N(µ, y) with Σ F, (3.6)

11 3 Distributions with Random Parameters 41 where F is some distribution (on (, )). For simplicity we assume in the following that µ. As an example, consider normally distributed observations with rare disturbances. More specifically, the observations might be N(, 1)-distributed with probability.99 and N(, 1)-distributed with probability.1. We may write this as X N(, Σ ), where P (Σ 1).99 and P (Σ 1).1. By Theorem.1 it follows immediately that E X. As for the variance, Corollary.3.1 tells us that Var X E Var (X Σ ) + Var ( E(X Σ ) ) E Σ If Σ has a continuous distribution, computations such as those above yield ( ) x F X (x) Φ f Σ (y) dy, y from which the density function of X is obtained by differentiation: f X (x) ( ) 1 x φ f Σ (y) dy y y Mean and variance can be found via the results of Section : E X E ( E(X Σ ) ), 1 πy e x /y f Σ (y) dy. (3.7) Var X E Var (X Σ ) + Var ( E(X Σ ) ) E Σ. Next, we determine the distribution of X under the particular assumption that Σ Exp(1). We are thus faced with the situation By (3.7), f X (x) X Σ y N(, y) with Σ Exp(1) (3.8) 1 e x /y e y dy [ set y u ] πy 1 e x /u e u du π π exp x u u} du. In order to solve this integral, the following device may be of use: Let x >, set I(x) exp x u u} du,

12 4 Conditioning differentiate (differentiation and integration may be interchanged), and make the change of variable y x/u. This yields I (x) ( x ) u exp x u u} du } exp y x y dy. It follows that I satisfies the differential equation with the initial condition the solution of which is I() I(x) I (x) I(x) e u du π, π e x, x >. (3.9) By inserting (3.9) into the expression for f X (x), and noting that the density is symmetric around x, we finally obtain π f X (x) π e x 1 e x 1 e x, < x <, that is, X L( 1 ); a Laplace distribution. An extra check yields E X and Var X E Σ 1 ( ( 1 ) ), as desired. Exercise 3.3. Show that if X has a normal distribution such that the mean is zero and the inverse of the variance is Γ-distributed, viz., ( n X Σ λ N(, 1/λ) with Σ Γ n),, then X t(n). Exercise 3.4. Sheila has a coin with P (head) p 1 and Betty has a coin with P (head) p. Sheila tosses her coin m times. Each time she obtains heads, Betty tosses her coin (otherwise not). Find the distribution of the total number of heads obtained by Betty. Further, check that mean and variance coincide with the values obtained by Theorem.1 and Corollary.3.1. Alternatively, find mean and variance first and try to guess the desired distribution (and check if your guess was correct). As a hint, observe that the game can be modeled as follows: Let N be the number of heads obtained by Sheila and X be the number of heads obtained by Betty. We thus wish to find the distribution of X, where X N n Bin(n, p ) with N Bin(m, p 1 ), < p 1, p < 1. We shall return to the topic of this section in Section 3.5.

13 4 The Bayesian Approach 43 4 The Bayesian Approach A typical problem in probability theory begins with assumptions such as let X Po(m), let Y N(µ, σ ), toss a symmetric coin 15 times, and so forth. In the computations that follow, one tacitly assumes that all parameters are known, that the coin is exactly symmetric, and so on. In statistics one assumes (certain) parameters to be unknown, for example, that the coin might be asymmetric, and one searches for methods, devices, and rules to decide whether or not one should believe in certain hypotheses. Two typical illustrations in the Gaussian approach are µ unknown and σ known and µ and σ unknown. The Bayesian approach is a kind of compromise. One claims, for example, that parameters are never completely unknown; one always has some prior opinion or knowledge about them. A probabilistic model describing this approach was given in Example 3.1. The opening statement there was that the density of red blood corpuscles follows a Poisson distribution. One interpretation of that statement could have been that whenever we are faced with a blood sample the density of red blood corpuscles in the sample is Poissonian. The Bayesian approach taken in Example 3.1 is that whenever we know from whom the blood sample has been taken, the density of red blood corpuscles in the sample is Poissonian, however, with a parameter depending on the individual. If we do not know from whom the sample has been taken, then the parameter is unknown; it is a random variable following some distribution. We also found that if this distribution is the standard exponential, then the density of red blood corpuscles is geometric (and hence not Poissonian). The prior knowledge about the parameters in this approach is expressed in such a way that the parameters are assumed to follow some probability distribution, called the prior (or a priori) distribution. If one wishes to assume that a parameter is completely unknown, one might solve the situation by attributing some uniform distribution to the parameter. In this terminology we may formulate our findings in Example 3.1 as follows: If the parameter in a Poisson distribution has a standard exponential prior distribution, then the random variable under consideration follows a Ge(1/)-distribution. Frequently, one performs random experiments in order to estimate (unknown) parameters. The estimates are based on observations from some probability distribution. The Bayesian analog is to determine the conditional distribution of the parameter given the result of the random experiment. Such a distribution is called the posterior (or a posteriori) distribution. Next we determine the posterior distribution in Example 3.1. Example 4.1. The model in the example was X M m Po(m) with M Exp(1). (4.1)

14 44 Conditioning We further had found that X Ge(1/). Now we wish to determine the conditional distribution of M given the value of X. For x >, we have P (M x} X k}) F M Xk (x) P (M x X k) P (X k) x P (X k M y) f M (y) dy P (X k) x e y yk k! e y dy ( 1 )k+1 which, after differentiation, yields f M Xk (x) x 1 Γ(k + 1) yk k+1 e y dy, 1 Γ(k + 1) xk k+1 e x, x >. Thus, M X k Γ(k + 1, 1 ) or, in our new terminology, the posterior distribution of M given that X equals k is Γ(k + 1, 1 ). Remark 4.1. Note that, starting from the distribution of X given M (and from that of M), we have determined the distribution of M given X and that the solution of the problem, in fact, amounted to applying a continuous version of Bayes formula. Exercise 4.1. Check that E M and Var M are what they are supposed to be by applying Theorem.1 and Corollary.3.1 to the posterior distribution. We conclude this section by studying coin tossing from the Bayesian point of view under the assumption that nothing is known about p P (heads). Let X n be the number of heads after n coin tosses. One possible model is X n P p Bin(n, p) with P U(, 1). (4.) The prior distribution of P, thus, is the U(, 1)-distribution. Models of this kind are called mixed binomial models. For k, 1,,..., n, we now obtain (via some facts about the beta distribution) 1 ( ) n P (X n k) x k (1 x) n k 1 dx k ( ) n 1 x (k+1) 1 (1 x) (n+1 k) 1 dx k ( n k ) Γ(k + 1)Γ(n + 1 k) Γ(k n + 1 k) n! k! (n k)! k! (n k)! (n + 1)! 1 n + 1.

15 4 The Bayesian Approach 45 This means that X n is uniformly distributed over the integers, 1,..., n. A second thought reveals that this is a very reasonable conclusion. Since nothing is known about the coin (in the sense of relation (4.)), there is nothing that favors a specific outcome, that is, all outcomes should be equally probable. If p is known, we know that the results in different tosses are independent and that the probability of heads given that we obtained 1 heads in a row (still) equals p. What about these facts in the Bayesian model? P (X n+1 n + 1 X n n) P (X n+1 n + 1} X n n}) P (X n n) P (X n+1 n + 1) P (X n n) 1 n+ 1 n+1 n + 1 n + 1 as n. This means that if we know that there were many heads in a row then the (conditional) probability of another head is very large; the results in different tosses are not at all independent. Why is this the case? Let us find the posterior distribution of P. x P (P x X n k) P (X n k P y) f P (y) dy P (X n k) x ( n ) k y k (1 y) n k 1 dy Differentiation yields f P Xnk(x) 1 n+1 ( ) n x (n + 1) y k (1 y) n k dy. k Γ(n + ) Γ(k + 1)Γ(n + 1 k) xk (1 x) n k, < x < 1, viz., a β(k + 1, n + 1 k)-distribution. For k n we obtain in particular (or, by direct computation) It follows that f P Xnn(x) (n + 1)x n, < x < 1. P (P > 1 ε X n n) 1 (1 ε) n+1 1 as n for all ε >. This means that if we know that there were many heads in a row then we also know that p is close to 1 and thus that it is very likely that the next toss will yield another head. Remark 4.. It is, of course, possible to consider the posterior distribution as a prior distribution for a further random experiment, and so on.

16 46 Conditioning 5 Regression and Prediction A common statistics problem is to analyze how different (levels of) treatments or treatment combinations affect the outcome of an experiment. The yield of a crop, for example, may depend on variability in watering, fertilization, climate, and other factors in the various areas where the experiment is performed. One problem is that one cannot predict the outcome y exactly, meaning without error, even if the levels of the treatments x 1, x,..., x n are known exactly. An important function for predicting the outcome is the conditional expectation of the (random) outcome Y given the (random) levels of treatment X 1, X,..., X n. Let X 1, X,..., X n and Y be jointly distributed random variables, and set h(x) h(x 1,..., x n ) E(Y X 1 x 1,..., X n x n ) E(Y X x). Definition 5.1. The function h is called the regression function Y on X. Remark 5.1. For n 1 we have h(x) E(Y X x), which is the ordinary conditional expectation. Definition 5.. A predictor (for Y ) based on X is a function, d(x). The predictor is called linear if d is linear, that is, if d(x) a +a 1 X 1 + +a n X n, where a, a 1,..., a n are constants. Predictors are used to predict (as the name suggests). The prediction error is given by the random variable Y d(x). (5.1) There are several ways to compare different predictors. One suitable measure is defined as follows: Definition 5.3. The expected quadratic prediction error is E ( Y d(x) ). Moreover, if d 1 and d are predictors, we say that d 1 is better than d if E(Y d 1 (X)) E(Y d (X)). In the following we confine ourselves to considering the case n 1. A predictor is thus a function of X, d(x), and the expected quadratic prediction error is E(Y d(x)). If the predictor is linear, that is, if d(x) a + bx, where a and b are constants, the expected quadratic prediction error is E(Y (a + bx)).

17 5 Regression and Prediction 47 Example 5.1. Pick a point uniformly distributed in the triangle x, y, x + y 1. We wish to determine the regression functions E(Y X x) and E(X Y y). To solve this problem we first note that the joint density of X and Y is c, for x, y, x + y 1, f X,Y (x, y), otherwise, where c is some constant, which is found by noticing that the total mass equals 1. We thus have 1 ( 1 x ) 1 f X,Y (x, y) dxdy c dy dx c 1 (1 x) dx c [ (1 x) ] 1 c, from which it follows that c. In order to determine the conditional densities we first compute the marginal ones: f X (x) f Y (y) f X,Y (x, y) dy f X,Y (x, y) dx 1 x 1 y dy (1 x), < x < 1, dx (1 y), < y < 1. Incidentally, X and Y have the same distribution for reasons of symmetry. Finally, and so f Y Xx (y) f X,Y (x, y) f X (x) E(Y X x) and, by symmetry, 1 x y (1 x) 1 1 x, < y < 1 x, 1 1 x dy 1 [ y 1 x E(X Y y) 1 y ] 1 x. (1 x) (1 x) 1 x Remark 5.. Note also, for example, that Y X x U(, 1 x) in the example, that is, the density is, for x fixed, a constant (which is the inverse of the length of the interval (, 1 x)). This implies that E(Y X x) (1 x)/, which agrees with the previous results. It also provides an alternative solution to the last part of the problem. In this case the gain is marginal, but in a more technically complicated situation it might be more substantial.

18 48 Conditioning Exercise 5.1. Solve the same problem when cx, for < x, y < 1, f X,Y (x, y), otherwise. Exercise 5.. Solve the same problem when e y, for < x < y, f X,Y (x, y), otherwise. Theorem 5.1. Suppose that E Y <. Then h(x) E(Y X) (i.e., the regression function Y on X) is the best predictor of Y based on X. Proof. By Theorem.3 we know that for an arbitrary predictor d(x), E ( Y d(x) ) E Var (Y X) + E ( h(x) d(x) ) E Var (Y X), where equality holds iff d(x)h(x) (more precisely, iff P (d(x)h(x)) 1). The choice d(x) h(x) thus yields minimal expected quadratic prediction error. Example 5.. In Example 5.1 we found the regression function of Y based on X to be (1 X)/. By Theorem 5.1 it is the best predictor of Y based on X. A simple calculation shows that the expected quadratic prediction error is E(Y (1 X)/) 1/48. We also noted that X and Y have the same marginal distribution. A (very) naive suggestion for another predictor therefore might be X itself. The expected quadratic prediction error for this predictor is E(Y X) 1/4 > 1/48, which shows that the regression function is indeed a better predictor. Sometimes it is difficult to determine regression functions explicitly. In such cases one might be satisfied with the best linear predictor. This means that one wishes to minimize E(Y (a + bx)) as a function of a and b, which leads to the well-known method of least squares. The solution of this problem is given in the following result. Theorem 5.. Suppose that E X < and E Y <. Set µ x E X, µ y E Y, σ x Var X, σ y Var Y, σ xy Cov(X, Y ), and ρ σ xy /σ x σ y. The best linear predictor of Y based on X is L(X) α + βx, where α µ y σ xy σx µ x µ y ρ σ y µ x and β σ xy σ x σx ρ σ y σ x.

19 5 Regression and Prediction 49 The best linear predictor thus is µ y + ρ σ y σ x (X µ x ). (5.) Definition 5.4. The line y µ y + ρ σy σ x (x µ x ) is called the regression line Y on X. The slope, ρ σy σ x, of the line is called the regression coefficient. Remark 5.3. Note that y L(x), where L(X) is the best linear predictor of Y based on X. Remark 5.4. If, in particular, (X, Y ) has a joint Gaussian distribution, it turns out that the regression function is linear, that is, for this very important case the best linear predictor is, in fact, the best predictor. For details, we refer the reader to Section 5.6. Example 5.1 (continued). The regression function Y on X turned out to be linear in this example; y (1 x)/. It follows in particular that the regression function coincides with the regression line Y on X. The regression coefficient equals 1/. The expected quadratic prediction error of the best linear predictor of Y based on X is obtained as follows: Theorem 5.3. E ( Y L(X) ) σ y (1 ρ ). Proof. E ( Y L(X) ) E ( Y µy ρ σ y σ x (X µ x ) ) E(Y µy ) + ρ σ y σx E(X µ x ) ρ σ y E(Y µ y )(X µ x ) σ x σ y + ρ σ y ρ σ y σ x σ xy σ y(1 ρ ). Definition 5.5. The quantity σ y(1 ρ ) is called residual variance. Exercise 5.3. Check via Theorem 5.3 that the residual variance in Example 5.1 equals 1/48 as was claimed in Example 5.. The regression line X on Y is determined similarly. It is which can be rewritten as x µ x + ρ σ x σ y (y µ y ), y µ y + 1 ρ σy σ x (x µ x )

20 5 Conditioning if ρ. The regression lines Y on X and X on Y are thus, in general, different. They coincide iff they have the same slope iff ρ σy σ x 1 ρ σy σ x ρ 1, that is, iff there exists a linear relation between X and Y. Example 5.1 (continued). The regression function X on Y was also linear (and coincides with the regression line X on Y ). The line has the form x (1 y)/, that is, y 1 x. In particular, we note that the slopes of the regression lines are 1/ and, respectively. 6 Problems 1. Let X and Y be independent Exp(1)-distributed random variables. Find the conditional distribution of X given that X + Y c (c is a positive constant).. Let X and Y be independent Γ(, a)-distributed random variables. Find the conditional distribution of X given that X + Y. 3. The life of a repairing device is Exp(1/a)-distributed. Peter wishes to use it on n different, independent, Exp(1/na)-distributed occasions. (a) Compute the probability P n that this is possible. (b) Determine the limit of P n as n. 4. The life T (hours) of the lightbulb in an overhead projector follows an Exp(1)-distribution. During a normal week it is used a Po(1)- distributed number of lectures lasting exactly one hour each. Find the probability that a projector with a newly installed lightbulb functions throughout a normal week (without replacing the lightbulb). 5. The random variables N, X 1, X,... are independent, N Po(λ), and X k Be(1/), k 1. Set Y 1 N X k and Y N Y 1 k1 (Y 1 for N ). Show that Y 1 and Y are independent, and determine their distributions. 6. Suppose that X N(, 1) and Y Exp(1) are independent random variables. Prove that X Y has a standard Laplace distribution. 7. Let N Ge(p) and set X ( 1) N. Compute (a) E X and Var X, (b) the distribution (probability function) of X. 8. The density function of the two-dimensional random variable (X, Y ) is f X,Y (x, y) x y 3 e x y, for < x <, < y < 1,, otherwise.

21 6 Problems 51 (a) Determine the distribution of Y. (b) Find the conditional distribution of X given that Y y. (c) Use the results from (a) and (b) to compute E X and Var X. 9. The density of the random vector (X, Y ) is cx, for x, y, x + y 1, f X,Y (x, y), otherwise. Compute (a) c, (b) the conditional expectations E(Y X x) and E(X Y y). 1. Suppose X and Y have a joint density function given by cx, for < x < y < 1,, otherwise. Find c, the marginal density functions, E X, E Y, and the conditional expectations E(Y X x) and E(X Y y). 11. Suppose X and Y have a joint density function given by c x y, for < y < x < 1,, otherwise. Compute c, the marginal densities, E X, E Y, and the conditional expectations E(Y X x) and E(X Y y). 1. Let X and Y have joint density cxy, when < y < x < 1,, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y). 13. Let X and Y have joint density cy, when < y < x <,, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y). 14. Suppose that X and Y are random variables with joint density c(x + y), when < x < y < 1,, otherwise. Compute the regression functions E(Y X x) and E(X Y y).

22 5 Conditioning 15. Suppose that X and Y are random variables with a joint density 5 (x + 3y), when < x, y < 1,, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y). 16. Let X and Y be random variables with a joint density 4 5 (x + 3y)e x y, when x, y >,, otherwise. Compute the regression functions E(Y X x) and E(X Y y). 17. Suppose that the joint density of X and Y is given by xe x xy, when x >, y >,, otherwise. Determine the regression functions E(Y X x) and E(X Y y). 18. Let the joint density function of X and Y be given by c(x + y), for < x < y < 1,, otherwise. Determine c, the marginal densities, E X, E Y, and the conditional expectations E(Y X x) and E(X Y y). 19. Let the joint density of X and Y be given by c, for x 1, x y x, f X,Y (x, y), otherwise. Compute c, the marginal densities, and the conditional expectations E(Y X x) and E(X Y y).. Suppose that X and Y are random variables with joint density cx, when < x < 1, x 3 < y < x 1/3,, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y). 1. Suppose that X and Y are random variables with joint density cy, when < x < 1, x 4 < y < x 1/4,, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y).

23 . Let the joint density function of X and Y be given by c x 3 y, for x, y >, x + y 1,, otherwise. 6 Problems 53 Compute c, the marginal densities, and the conditional expectations E(Y X x) and E(X Y y). 3. The joint density function of X and Y is given by c xy, for x, y >, 4x + y 1,, otherwise. Compute c, the marginal densities, and the conditional expectations E(Y X x) and E(X Y y). 4. Let X and Y have joint density c x 3, when 1 < y < x, y, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y). 5. Let X and Y have joint density c x 4, when 1 < y < x, y, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y). 6. Suppose that X and Y are random variables with a joint density c, when < y < x < 1, (1 + x y), otherwise. Compute the conditional expectations E(Y X x) and E(X Y y). 7. Suppose that X and Y are random variables with a joint density c cos x, when < y < x < π,, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y). 8. Let X and Y have joint density c log y, when < y < x < 1,, otherwise. Compute the conditional expectations E(Y X x) and E(X Y y).

24 54 Conditioning 9. The random vector (X, Y ) has the following joint distribution: ( ) m 1 m P (X m, Y n) n m 15, where m 1,,..., 5 and n, 1,..., m. Compute E(Y X m). 3. Show that a suitable power of a Weibull-distributed random variable whose parameter is gamma-distributed is Pareto-distributed. More precisely, show that if X A a W ( 1 a, 1 b ) with A Γ(p, θ), then X b has a (translated) Pareto distribution. 31. Show that an exponential random variable such that the inverse of the parameter is gamma-distributed is Pareto-distributed. More precisely, show that if X M m Exp(m) with M 1 Γ(p, a), then X has a (translated) Pareto distribution. 3. Let X and Y be random variables such that Y X x Exp(1/x) with X Γ(, 1). (a) Show that Y has a translated Pareto distribution. (b) Compute E Y. (c) Check the value in (b) by recomputing it via our favorite formula for conditional means. 33. Suppose that the random variable X is uniformly distributed symmetrically around zero, but in such a way that the parameter is uniform on (, 1); that is, suppose that X A a U( a, a) with A U(, 1). Find the distribution of X, E X, and Var X. 34. In Section 4 we studied the situation when a coin, such that p P (head) is considered to be a U(, 1)-distributed random variable, is tossed, and found (i.a.) that if X n # heads after n tosses, then X n is uniformly distributed over the integers, 1,..., n. Suppose instead that p is considered to be β(, )-distributed. What then? More precisely, consider the following model: X n Y y Bin(n, y) with f Y (y) 6y(1 y), < y < 1. (a) Compute E X n and Var X n. (b) Determine the distribution of X n. 35. Let X and Y be jointly distributed random variables such that Y X x Bin(n, x) with X U(, 1). Compute E Y, Var Y, and Cov(X, Y ) (without using what is known from Section 4 about the distribution of Y ).

25 36. Let X and Y be jointly distributed random variables such that Y X x Fs(x) with f X (x) 3x, x 1. 6 Problems 55 Compute E Y, Var Y, Cov (X, Y ), and the distribution of Y. 37. Let X be the number of coin tosses until heads is obtained. Suppose that the probability of heads is unknown in the sense that we consider it to be a random variable Y U(, 1). (a) Find the distribution of X (cf. Problem ). (b) The expected value of an Fs-distributed random variable exists, as is well known. What about E X? (c) Suppose that the value X n has been observed. Find the posterior distribution of Y, that is, the distribution of Y X n. 38. Let p be the probability that the tip points downward after a person throws a drawing pin once. Annika throws a drawing pin until it points downward for the first time. Let X be the number of throws for this to happen. She then throws the drawing pin another X times. Let Y be the number of times the drawing pin points downward in the latter series of throws. Find the distribution of Y (cf. Problem ). 39. A point P is chosen uniformly in an n-dimensional sphere of radius 1. Next, a point Q is chosen uniformly within the concentric sphere, centered at the origin, going through P. Let X and Y be the distances of P and Q, respectively, to the common center. Find the joint density function of X and Y and the conditional expectations E(Y X x) and E(X Y y). Hint 1. Begin by trying the case n. Hint. The volume of an n-dimensional sphere of radius r is equal to c n r n, where c n is some constant (which is of no interest for the problem). Remark. For n 1 we rediscover the stick from Example Let X and Y be independent random variables. The conditional distribution of Y given that X x then does not depend on x. Moreover, E(Y X x) is independent of x; recall Theorem.(b) and Remark.4. Now, suppose instead that E(Y X x) is independent of x (i.e., that E(Y X) E Y ). We say that Y has constant regression with respect to X. However, it does not necessarily follow that X and Y are independent. Namely, let the joint density of X and Y be given by 1, for x + y 1,, otherwise. Show that Y has constant regression with respect to X and/but that X and Y are not independent.

Formulas for probability theory and linear models SF2941

Formulas for probability theory and linear models SF2941 These pages + Appendix 2 of Gut) are permitted as assistance at the exam. 11 maj 2008 Selected formulae of probability Bivariate probability Transforms