On Least Squares Linear Regression Without Second Moment BY RAJESHWARI MAJUMDAR University of Connecticut If \ and ] are real valued random variables such that the first moments of \, ], and \] exist and the conditional expectation of ] given \ is an affine function of \, then the intercept and slope of the conditional expectation equal the intercept and slope of the least squares linear regression function, even though ] may not have a finite second moment. As a consequence, the affine in \ form of the conditional expectation and zero covariance imply mean independence. 1 Introduction If \ and ] are real valued random variables such that the conditional expectation of ] given \ is an affine function of \, then the intercept and slope of the conditional expectation equal, respectively, the intercept and slope of the least squares linear regression function. As explained in Remark 1, when both \ and ] have finite second moments, this equality follows from the well understood connection among conditional expectation, least squares linear regression, and the operation of projection. However, that this equality continues to hold when one only assumes that the first moments of \, ], and \] exist is the most important finding of this note [Theorem 1]. Theorem 1 evolves from the investigation of the directional hierarchy of the interdependence among the notions of independence, mean independence, and zero covariance. It is well-known that, for random variables \ and ], independence implies mean independence and mean independence implies zero covariance, whenever the notions of mean independence and covariance make sense. Note that, the notion of covariance makes sense as long as the first moments of \, ], and \] exist; it does not require \ and ] to have finite second moments. We review well-known counterexamples to document that the direction of this hierarchy cannot be reversed in general. Theorem 1, above and beyond establishing that an affine in \ form of the conditional expectation of ] given \ implies its equality with the least squares linear regression function, also leads to the conclusion that mean independence is necessary and sufficient for zero covariance when the conditional expectation is affine in \. Remark 2 examines the feasibility of obtaining the result of Theorem 1 by using the projection operator interpretation of conditional expectation and least squares linear regression elucidated in Remark 1 and the technique of extending operators to the closure MSC 2010 subject classifications: 62G08, 62J05 Keywords and phrases: Independence, mean independence, zero covariance, projection..
of their domains, and concludes in the negative. Remark 3 explains how the relaxation of the assumption of ] having a finite second moment is non-trivial. The following notational conventions are used throughout the note. Equality (or inequality) involving measurable functions defined on a probability space, unless otherwise indicated, indicates that the relation holds almost surely. Sets are always identified with their indicator functions and ¾ denotes the universal null set. The normal distribution with mean. and variance 5 is denoted by a. ß 5. In what follows, we repeatedly use the averaging property, the pull-out property, and the chain rule of conditional expectation [Kallenberg (2002, page 105)]. 2 Results Let \ and ] be real valued random variables on a probability space HY ß ßT. Let E denote the expectation induced by T and E the conditional expectation given a sub 5- algebra of Y Þ If œ 5Y for some random variable Y defined on HßYßT, we Y write E in place of E. Let _" (respectively, _) denote the Banach (respectively, Hilbert) space of integrable (respectively, square integrable) real valued functions on HY ß ßT. If \ß] ß\] _ ", and \ and ] are independent, then Cov\ß] œ!. 1 As is well-known, the reverse implication is not true in general; see Example 1. Example 1. Let \µ a!ß" be independent of the Rademacher random variable [, defined by T [ œ " œ T [ œ " œ "Î. Let ] œ [ \; then ] µ a!ß " and 1 holds. However, \ and ] are not independent; because if they were, then \ ] would have the a!ß distribution, implying T \ ] œ! œ!, whereas in actuality T \ ] œ! œ ". // Definition 1. If ] _ " and E \ ] œ E], 2 ] is said to be mean independent of \. Similarly, if \ _ " and E ] \ œ E\, 3 \ is said to be mean independent of ]. Example 2 shows that ] can be mean independent of \ without \ being mean independent of ]. Example 2. Consider the equi-probable discrete sample space H œ "ß!ß " and define random variables \ and ] on H as \ = œ = œ! and ] = œ =. Since 5\ œ ¾ß! ß "ß" ß H and E]E is trivially equal to! for all E 5\, we
obtain E \ ] œ! œ E ]. However, since \ = œ ] = œ!, \ is 5 ] measurable, and consequently E ] \ œ \, whereas E\ œ "Î$. // It follows from the definition of independence and conditional expectation [Dudley (1989, page 264)] that for \ and ] independent, 2 holds if ] _ ", whereas 3 holds if \ _ ". The asymmetric nature of the notion of mean independence established in Example 2 shows that 2 (or, for that matter, 3 ) does not imply independence of \ and ]. Example 3 extends Example 1 to show that even 2 and 3 combined does not necessarily imply independence of \ and ]. Example 3. Let \, [, and ] be as in Example 1. Since \ and [ are independent and \µ a!ß", by Corollary 7.1.2 of Chow and Teicher (1988), E\] F œ " E\\ F " E\ \ F œ!, implying that E ] \ œ! œ E\. Clearly, by the pull-out property, \ \ \ E ] œ E [\ œ\ E [ œ\ E[ œ!œ E]. However, as observed in Example 1, \ and ] are not independent. // As observed in the introduction, that mean independence implies zero covariance is wellknown. Example 4 shows that zero covariance, that is 1, does not necessarily imply mean independence of 2. Example 4. Let \ be uniformly distributed over the interval "ß" and ] œ \. $ Since E\ œ! œ E\ œ E\], 1 holds. Since E] œ E\ œ Var\ œ "Î$ and E \ ] œ E \ \ œ \, 2 does not hold. // Theorem 1 shows that if ] is mean independent of \ (as in 2 ), then 1 holds; it also characterizes the setup wherein the reverse implication holds. Theorem 1. Assume that for some α" ß d. Conversely, 4 implies \ß] ß\] _ ". Then 2 implies 1 and E \ ] œ α "\ 4 α œ E] " E\ 5 and Cov\ß] " œ Var\! ; 6 Var\ further, 6, in conjunction with 1, implies 2.
Clearly, if \ is mean independent of ] as in 3, then 1 holds as well. Also, going back to Example 4, we now know why 1 does not imply 2 in that context; since E \ ] œ \, 4 does not hold, and by Theorem 1, 2 cannot hold. Proof of Theorem 1. By the averaging and pull-out properties, E E \ E E \ \] œ \] œ \ E ]. 7 Also, since \ _ ", Var\ is well defined, though it may be. Assume 2 holds. Then 1 follows from 7 ; also, 4 holds with α œ E] and " œ!. Conversely, assume 4 holds. Taking expectations of both sides, E] œ α " E\, 8 whence 5 follows. Thus it remains to show 6. Note that once we show 6, 1 implies " œ!, which, via 4 and 8, implies 2. Now note that 8 implies E\ E] œ α E\ " E\. 9 By the averaging and pull-out properties, E E \ \] œ \ E ]. Since \ \ \ \ E ] Ÿ \ E ], we obtain by 4 that \ α "\ œ \ E ] _". Also, since \ _, α \ _ for every α d; consequently, " " "\ _". 10 If \Â_, that is, if Var\ œ, then the right-hand side of 6 is! by definition and the left-hand side of 6 is! by 10. If \, that is, if Var\, by 7 and 4, _ E\] œ E\ α "\ œ α E\ " E\. 11 Subtracting 9 from 11, Cov \ß] œ "Var\. Multiplying both sides by 5, where 5œ Var\! Var\! if Var\ œ! œ " if Var\!, Var\ we obtain Cov\ß] Var\! œ " Var\!. Var\ Now note that 6 will follow once we show that "Var \ œ! œ!. Since, by the definition of variance, \ Var\ œ! œ E\ Var \ œ!, by 8,
α Var\ œ! "\ Var\ œ! œ E] Var\ œ! ; in other words, an affine function of \ with slope "Var\ œ! is identically equal to a constant, showing that the slope is equal to!, thereby completing the proof. Remark 1. Is there an element of surprise in the conclusion of Theorem 1 which asserts that 4 implies 5 and 6? The answer is no when \ß] _, since in that case it follows from the well-understood connection (outlined below) between conditional expectation, least squares linear regression, and the operation of projection. For a fixed real valued measurable function \ on HY ß ßT, let _ \ denote the Hilbert space of real valued measurable functions 0 on H5 ß \ ßT such that E0. Clearly, _ \ _. Let denote the _ norm on _, and by inheritance, on _ \. Define ` œ 0 _ À for some 1 _ \, 0 œ 1 outside of a T-null set in Y. Clearly, ` is a subspace of _ that contains _ \. If 08 À8 " is a sequence in ` that converges to 0 _, then, since every convergent sequence is Cauchy, 08 07 Ä! as 7ß8 Ä. By definition of `, there exists a sequence 18 À8 " in _ \ such that 08 07 œ18 17, implying that 18 À8 " is a Cauchy sequence in _ \. Since _ \ is complete, there exists 1 _ \ such that 18 1 Ä! as 8 Ä. Since 08 18 œ! for every 8 ", by the triangle 2 inequality in _, 01 œ!, showing that 0 `, that is, ` is closed in _. Since \ \ E ] Ÿ E ] by the Conditional Jensen's Inequality [Dudley (1989, Theorem 10.2.7)], E \ ] _ \ by the averaging property. Let X denote the map \ from _ to _ \ that takes ] _ to E ] _ \. Note that, X depends on \, but we suppress that dependence for notational convenience. Clearly, X is linear. Since \ \ E ] Ÿ E ], X] Ÿ ] by the averaging property. Thus, X is a contraction operator. For any ] _ and 1 _\, using the pull-out property, ]X]ßX]1 œ!, where, denotes the inner product in _, and consequently,, m] 1 œm] X] mx] 1 showing that X] is the unique minimizer of m] 1 as 1 varies over _ \. Recall that ` is a closed subspace of _; let the orthogonal projection from _ to ` be denoted by C. Then, for any ] _ and 0 `, ` ` ` m] 0 œm] C ] m C ] 0, showing that C`] is the unique minimizer of m] 0 as 0 varies over `. Since
m] 1 À 1 _ \ œ m] 0 À 0 `, C`] equals X] outside of a T-null set in Y. Note that if the probability space H5 ß \ ßT is complete, so that the almost sure limit of a sequence of measurable functions is measurable, _ \ becomes a closed subspace of _, and in our identification of the conditional expectation as a projection, we can avoid the construction involving `. Let [ denote the two-dimensional linear space spanned by N and \, where N is the real valued function defined on HY ß ßT that is almost surely equal to "; since \ _, we obtain [ _ \ `. Putting the degenerate case when \ is a scalar multiple of N aside for the time being and applying the Gram-Schmidt orthonormalization process to the basis Nß\ of [, we obtain that Nß\, where \Nß\ N \ œ \Nß\ N, is an orthonormal basis of [. Let T denote the subspace of _ that consists of all ] _ such that X] [, that is, 4 holds for ]. If ] T, then C ] œ X] [ _ \ `, and consequently, ` X] œ C ] œ C C ] œ C ] œ Nß] N \ ß] \, 12 ` [ ` [ where C [ denotes the orthogonal projection from _ to [. In more familiar notation, 12 asserts E \ Cov\ß] ] œ E] E Var\ \ \, that is, 5 and 6 hold. Note that, if \ is a scalar multiple of N, that is, \ is a degenerate random variable and Var\ œ!, then [ is simply the span of N, and we obtain that X] œ Nß] \ N, that is, E ] œ E], leading to the conclusion of Theorem 1 for \ß] _. // Remark 2. Can the conclusion of Theorem 1, when we only have ] _" Ï _, \ _", and \] _ ", be obtained using the projection operator tools employed in Remark 1? The answer, as far as we can tell, is no. The domain of the map X can be expanded to _ ", causing the range to be expanded to _"\, the Banach space of real valued measurable functions 0 on Hß 5\ ßT such that E0. The linearity of X is not impacted by this expansion of domain. Since \ \ E ] Ÿ E ] (again by the Conditional Jensen's Inequality), X] Ÿ ] by " " the averaging property, where denotes the _ norm on, and by inheritance, on " " _" _ " \. Thus, X remains a contraction operator. In fact, as observed by Kallenberg
(2002), X on _ is uniformly _"-continuous and its extension to a linear and continuous map on is unique up to almost sure equivalence. _ " The definition of T can be extended to denote the subspace of _ " that consists of all ] _" such that X] [. Since [ is a finite-dimensional, hence closed, subspace of _, and X is continuous, T is a closed subspace of _. " " Let U denote the subspace of _" that consists of all ] _" such that \] _". If \ _, then, by the Cauchy-Schwartz inequality, _ U. Theorem 1 asserts that the representation of X as the orthogonal projection to [ that is valid on T _ can be extended to hold on T U. Given that the set inclusion relationship between the closure of T _ in _" and T U is generally indeterminable (if \ is bounded, then U œ _" and T œ T U, implying that the closure of T _ in _" is contained in T U, though the inclusion cannot be necessarily reversed), it seems that the result of Theorem 1, even when \ is bounded, can not be deduced using the technique of operator extension. What happens if \ _" Ï_? Clearly, [ is no longer a subspace of _. However, since _ is dense in _", we can find a sequence \ 8 À 8 " in _ such that \ 8 converges to \ in _". With [ 8 denoting the span of N and \ 8, the orthogonal projection of _ to [ 8 converges to the orthogonal projection of _ to the span of N. From the proof of Theorem 1, X on T U equals the "orthogonal projection," in a limiting sense, to the span of N. // Remark 3. What does the relaxation of the structural assumption from \ß] _ to ] _" Ï _, \ _", and \] _" entail? One can conceivably argue that the Cauchy- Schwartz inequality remains the primary tool for verifying that \] _ " when \ and ] are dependent random variables, and as such, this relaxation of assumption is neither insightful nor useful. While that argument may have some merit, we would like to point out that if \ is bounded, then obviously \ _, and ] _ Ï _ implies \] _. " " " A classic example of 4 holding for a bounded random variable \ is the Bernoulli random variable, since, for any measurable function 2\ of the Bernoulli random variable \, we have 2\ œ2! 2" 2! \, and E \ ] is a measurable function of \. Working with the non-central t-distribution with degrees of freedom, the following example presents \ß] such that \ _" but is not bounded, ] _" Ï _, \] _", and 4 holds. Let \ µ a!ß " and given \, [ µ a \ß ". Let Z µ ; be independent of [ß\. Let ] œ[îzî. Clearly, \ _ " but is not bounded. To verify that ] _" Ï _, we first obtain the marginal distribution of [. Using the form of the conditional density of [ given \ œ B and the form of the marginal density of \, we obtain that the joint density of [ß\ is given by
exp 0AßB œ A ABB ; 1 since BÈexpBAÎ 1 represents the density of the a A ß " distribution, completing the square in B we obtain the marginal density of [ to be A A A exp % expb exp % 0[ A œ 0AßB.B œ.b œ, d 1 d 1 1 showing that [ µ a!ß. Since Z œ Y, where Y is a > " random variable [Fabian and Hannan (1985, Definition 3.3.6)], by Theorem 3.3.9 of Fabian and Hannan (1985), " " " > E œ EY œ œ 1; 13 ZÎ > " since [ and Z are independent, ]. However, EÎZ œ EY œ, showing that ]Â_. _ " " To show that \] _ ", by independence of [ß\ and Z along with 13, it suffices to show that \[ _ ", which follows readily from the Cauchy-Schwartz inequality. Finally, by Corollary 7.1.2 of Chow and Teicher (1988) and 13, E [ß\ ] œ 1 [, implying, by the chain rule, E \ ] œ 1\, that is, 4 holds. ÎÎ References Chow, Y. S. and Teicher, H. (1988). Probability Theory, Independence, nd Interchangeability, Martingales. 2 Ed., Springer-Verlag, New York, NY. Dudley, R. M. (1989). Real Analysis and Probability. 1 Ed., Wadsworth & Brooks/Cole, Pacific Grove, CA. Fabian, V. and Hannan, J. (1985). Introduction to Probability and Mathematical Statistics, Wiley, New York, NY. Kallenberg, O. (2002). New York, NY. RAJESHWARI MAJUMDAR rajeshwari.majumdar@uconn.edu PO Box 47 Coventry, CT 06238 nd Foundations of Modern Probability. 2 Ed., Springer-Verlag, st