On Least Squares Linear Regression Without Second Moment

Similar documents
NECESSARY AND SUFFICIENT CONDITION FOR ASYMPTOTIC NORMALITY OF STANDARDIZED SAMPLE MEANS

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

Functional Analysis HW #5

CHAPTER II HILBERT SPACES

Orthogonal Projection and Least Squares Prof. Philip Pennance 1 -Version: December 12, 2016

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Exercise Solutions to Functional Analysis

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms.

Lecture Notes 1: Vector spaces

Lecture 22: Variance and Covariance

9 Brownian Motion: Construction

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality

Geometric interpretation of signals: background

Recall the convention that, for us, all vectors are column vectors.

October 25, 2013 INNER PRODUCT SPACES

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

INNER PRODUCT SPACE. Definition 1

Recall that any inner product space V has an associated norm defined by

Applied Linear Algebra in Geoscience Using MATLAB

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Introduction to Empirical Processes and Semiparametric Inference Lecture 22: Preliminaries for Semiparametric Inference

Linear Algebra March 16, 2019

Part III. 10 Topological Space Basics. Topological Spaces

Lecture 3: Review of Linear Algebra

Vectors in Function Spaces

Lecture 3: Review of Linear Algebra

Fourier and Wavelet Signal Processing

2. Review of Linear Algebra

Linear Algebra Massoud Malek

An introduction to some aspects of functional analysis

On the Set of Limit Points of Normed Sums of Geometrically Weighted I.I.D. Bounded Random Variables

Hilbert Spaces. Contents

Vector measures of bounded γ-variation and stochastic integrals

2. Signal Space Concepts

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Asymptotic Normality under Two-Phase Sampling Designs

The Hilbert Space of Random Variables

5 Compact linear operators

The following definition is fundamental.

1. Subspaces A subset M of Hilbert space H is a subspace of it is closed under the operation of forming linear combinations;i.e.,

Math 180B, Winter Notes on covariance and the bivariate normal distribution

Lecture 2: Linear Algebra Review

Lecture 6: Geometry of OLS Estimation of Linear Regession

y 2 . = x 1y 1 + x 2 y x + + x n y n 2 7 = 1(2) + 3(7) 5(4) = 3. x x = x x x2 n.

ON THE GENERALIZED FUGLEDE-PUTNAM THEOREM M. H. M. RASHID, M. S. M. NOORANI AND A. S. SAARI

Some Inequalities for Commutators of Bounded Linear Operators in Hilbert Spaces

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Elementary linear algebra

4 Sums of Independent Random Variables

Introduction to Bases in Banach Spaces

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Further Mathematical Methods (Linear Algebra) 2002

Calculus in Gauss Space

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Gaussian Processes. 1. Basic Notions

University of Regina. Lecture Notes. Michael Kozdron

Large Sample Properties of Estimators in the Classical Linear Regression Model

COMMON COMPLEMENTS OF TWO SUBSPACES OF A HILBERT SPACE

Mathematical Methods wk 1: Vectors

Mathematical Methods wk 1: Vectors

Math Linear Algebra II. 1. Inner Products and Norms

1. Foundations of Numerics from Advanced Mathematics. Linear Algebra

Inner product spaces. Layers of structure:

NOTES ON VECTORS, PLANES, AND LINES

3 Orthogonality and Fourier series

Real Analysis Notes. Thomas Goller

(y 1, y 2 ) = 12 y3 1e y 1 y 2 /2, y 1 > 0, y 2 > 0 0, otherwise.

ABSTRACT CONDITIONAL EXPECTATION IN L 2

A Primer in Econometric Theory

ADJOINTS, ABSOLUTE VALUES AND POLAR DECOMPOSITIONS

Chapter 6 Expectation and Conditional Expectation. Lectures Definition 6.1. Two random variables defined on a probability space are said to be

96 CHAPTER 4. HILBERT SPACES. Spaces of square integrable functions. Take a Cauchy sequence f n in L 2 so that. f n f m 1 (b a) f n f m 2.

4 Hilbert spaces. The proof of the Hilbert basis theorem is not mathematics, it is theology. Camille Jordan

EE731 Lecture Notes: Matrix Computations for Signal Processing

Next is material on matrix rank. Please see the handout

INF-SUP CONDITION FOR OPERATOR EQUATIONS

Mathematics Department Stanford University Math 61CM/DM Inner products

v = v 1 2 +v 2 2. Two successive applications of this idea give the length of the vector v R 3 :

Chapter Two Elements of Linear Algebra

ORTHOGONALITY AND LEAST-SQUARES [CHAP. 6]

Topic 2: The mathematical formalism and the standard way of thin

Grothendieck s Inequality

Introduction to Signal Spaces

On Unitary Relations between Kre n Spaces

Inner Product and Orthogonality

Topological vectorspaces

There are two things that are particularly nice about the first basis

MAT 771 FUNCTIONAL ANALYSIS HOMEWORK 3. (1) Let V be the vector space of all bounded or unbounded sequences of complex numbers.

The Transpose of a Vector

LINEAR MMSE ESTIMATION

Math 261 Lecture Notes: Sections 6.1, 6.2, 6.3 and 6.4 Orthogonal Sets and Projections

Incompatibility Paradoxes

Hilbert Space Methods Used in a First Course in Quantum Mechanics A Recap WHY ARE WE HERE? QUOTE FROM WIKIPEDIA

I teach myself... Hilbert spaces

[y i α βx i ] 2 (2) Q = i=1

Math Linear Algebra

Projection Theorem 1

Transcription:

On Least Squares Linear Regression Without Second Moment BY RAJESHWARI MAJUMDAR University of Connecticut If \ and ] are real valued random variables such that the first moments of \, ], and \] exist and the conditional expectation of ] given \ is an affine function of \, then the intercept and slope of the conditional expectation equal the intercept and slope of the least squares linear regression function, even though ] may not have a finite second moment. As a consequence, the affine in \ form of the conditional expectation and zero covariance imply mean independence. 1 Introduction If \ and ] are real valued random variables such that the conditional expectation of ] given \ is an affine function of \, then the intercept and slope of the conditional expectation equal, respectively, the intercept and slope of the least squares linear regression function. As explained in Remark 1, when both \ and ] have finite second moments, this equality follows from the well understood connection among conditional expectation, least squares linear regression, and the operation of projection. However, that this equality continues to hold when one only assumes that the first moments of \, ], and \] exist is the most important finding of this note [Theorem 1]. Theorem 1 evolves from the investigation of the directional hierarchy of the interdependence among the notions of independence, mean independence, and zero covariance. It is well-known that, for random variables \ and ], independence implies mean independence and mean independence implies zero covariance, whenever the notions of mean independence and covariance make sense. Note that, the notion of covariance makes sense as long as the first moments of \, ], and \] exist; it does not require \ and ] to have finite second moments. We review well-known counterexamples to document that the direction of this hierarchy cannot be reversed in general. Theorem 1, above and beyond establishing that an affine in \ form of the conditional expectation of ] given \ implies its equality with the least squares linear regression function, also leads to the conclusion that mean independence is necessary and sufficient for zero covariance when the conditional expectation is affine in \. Remark 2 examines the feasibility of obtaining the result of Theorem 1 by using the projection operator interpretation of conditional expectation and least squares linear regression elucidated in Remark 1 and the technique of extending operators to the closure MSC 2010 subject classifications: 62G08, 62J05 Keywords and phrases: Independence, mean independence, zero covariance, projection..

of their domains, and concludes in the negative. Remark 3 explains how the relaxation of the assumption of ] having a finite second moment is non-trivial. The following notational conventions are used throughout the note. Equality (or inequality) involving measurable functions defined on a probability space, unless otherwise indicated, indicates that the relation holds almost surely. Sets are always identified with their indicator functions and ¾ denotes the universal null set. The normal distribution with mean. and variance 5 is denoted by a. ß 5. In what follows, we repeatedly use the averaging property, the pull-out property, and the chain rule of conditional expectation [Kallenberg (2002, page 105)]. 2 Results Let \ and ] be real valued random variables on a probability space HY ß ßT. Let E denote the expectation induced by T and E the conditional expectation given a sub 5- algebra of Y Þ If œ 5Y for some random variable Y defined on HßYßT, we Y write E in place of E. Let _" (respectively, _) denote the Banach (respectively, Hilbert) space of integrable (respectively, square integrable) real valued functions on HY ß ßT. If \ß] ß\] _ ", and \ and ] are independent, then Cov\ß] œ!. 1 As is well-known, the reverse implication is not true in general; see Example 1. Example 1. Let \µ a!ß" be independent of the Rademacher random variable [, defined by T [ œ " œ T [ œ " œ "Î. Let ] œ [ \; then ] µ a!ß " and 1 holds. However, \ and ] are not independent; because if they were, then \ ] would have the a!ß distribution, implying T \ ] œ! œ!, whereas in actuality T \ ] œ! œ ". // Definition 1. If ] _ " and E \ ] œ E], 2 ] is said to be mean independent of \. Similarly, if \ _ " and E ] \ œ E\, 3 \ is said to be mean independent of ]. Example 2 shows that ] can be mean independent of \ without \ being mean independent of ]. Example 2. Consider the equi-probable discrete sample space H œ "ß!ß " and define random variables \ and ] on H as \ = œ = œ! and ] = œ =. Since 5\ œ ¾ß! ß "ß" ß H and E]E is trivially equal to! for all E 5\, we

obtain E \ ] œ! œ E ]. However, since \ = œ ] = œ!, \ is 5 ] measurable, and consequently E ] \ œ \, whereas E\ œ "Î$. // It follows from the definition of independence and conditional expectation [Dudley (1989, page 264)] that for \ and ] independent, 2 holds if ] _ ", whereas 3 holds if \ _ ". The asymmetric nature of the notion of mean independence established in Example 2 shows that 2 (or, for that matter, 3 ) does not imply independence of \ and ]. Example 3 extends Example 1 to show that even 2 and 3 combined does not necessarily imply independence of \ and ]. Example 3. Let \, [, and ] be as in Example 1. Since \ and [ are independent and \µ a!ß", by Corollary 7.1.2 of Chow and Teicher (1988), E\] F œ " E\\ F " E\ \ F œ!, implying that E ] \ œ! œ E\. Clearly, by the pull-out property, \ \ \ E ] œ E [\ œ\ E [ œ\ E[ œ!œ E]. However, as observed in Example 1, \ and ] are not independent. // As observed in the introduction, that mean independence implies zero covariance is wellknown. Example 4 shows that zero covariance, that is 1, does not necessarily imply mean independence of 2. Example 4. Let \ be uniformly distributed over the interval "ß" and ] œ \. $ Since E\ œ! œ E\ œ E\], 1 holds. Since E] œ E\ œ Var\ œ "Î$ and E \ ] œ E \ \ œ \, 2 does not hold. // Theorem 1 shows that if ] is mean independent of \ (as in 2 ), then 1 holds; it also characterizes the setup wherein the reverse implication holds. Theorem 1. Assume that for some α" ß d. Conversely, 4 implies \ß] ß\] _ ". Then 2 implies 1 and E \ ] œ α "\ 4 α œ E] " E\ 5 and Cov\ß] " œ Var\! ; 6 Var\ further, 6, in conjunction with 1, implies 2.

Clearly, if \ is mean independent of ] as in 3, then 1 holds as well. Also, going back to Example 4, we now know why 1 does not imply 2 in that context; since E \ ] œ \, 4 does not hold, and by Theorem 1, 2 cannot hold. Proof of Theorem 1. By the averaging and pull-out properties, E E \ E E \ \] œ \] œ \ E ]. 7 Also, since \ _ ", Var\ is well defined, though it may be. Assume 2 holds. Then 1 follows from 7 ; also, 4 holds with α œ E] and " œ!. Conversely, assume 4 holds. Taking expectations of both sides, E] œ α " E\, 8 whence 5 follows. Thus it remains to show 6. Note that once we show 6, 1 implies " œ!, which, via 4 and 8, implies 2. Now note that 8 implies E\ E] œ α E\ " E\. 9 By the averaging and pull-out properties, E E \ \] œ \ E ]. Since \ \ \ \ E ] Ÿ \ E ], we obtain by 4 that \ α "\ œ \ E ] _". Also, since \ _, α \ _ for every α d; consequently, " " "\ _". 10 If \Â_, that is, if Var\ œ, then the right-hand side of 6 is! by definition and the left-hand side of 6 is! by 10. If \, that is, if Var\, by 7 and 4, _ E\] œ E\ α "\ œ α E\ " E\. 11 Subtracting 9 from 11, Cov \ß] œ "Var\. Multiplying both sides by 5, where 5œ Var\! Var\! if Var\ œ! œ " if Var\!, Var\ we obtain Cov\ß] Var\! œ " Var\!. Var\ Now note that 6 will follow once we show that "Var \ œ! œ!. Since, by the definition of variance, \ Var\ œ! œ E\ Var \ œ!, by 8,

α Var\ œ! "\ Var\ œ! œ E] Var\ œ! ; in other words, an affine function of \ with slope "Var\ œ! is identically equal to a constant, showing that the slope is equal to!, thereby completing the proof. Remark 1. Is there an element of surprise in the conclusion of Theorem 1 which asserts that 4 implies 5 and 6? The answer is no when \ß] _, since in that case it follows from the well-understood connection (outlined below) between conditional expectation, least squares linear regression, and the operation of projection. For a fixed real valued measurable function \ on HY ß ßT, let _ \ denote the Hilbert space of real valued measurable functions 0 on H5 ß \ ßT such that E0. Clearly, _ \ _. Let denote the _ norm on _, and by inheritance, on _ \. Define ` œ 0 _ À for some 1 _ \, 0 œ 1 outside of a T-null set in Y. Clearly, ` is a subspace of _ that contains _ \. If 08 À8 " is a sequence in ` that converges to 0 _, then, since every convergent sequence is Cauchy, 08 07 Ä! as 7ß8 Ä. By definition of `, there exists a sequence 18 À8 " in _ \ such that 08 07 œ18 17, implying that 18 À8 " is a Cauchy sequence in _ \. Since _ \ is complete, there exists 1 _ \ such that 18 1 Ä! as 8 Ä. Since 08 18 œ! for every 8 ", by the triangle 2 inequality in _, 01 œ!, showing that 0 `, that is, ` is closed in _. Since \ \ E ] Ÿ E ] by the Conditional Jensen's Inequality [Dudley (1989, Theorem 10.2.7)], E \ ] _ \ by the averaging property. Let X denote the map \ from _ to _ \ that takes ] _ to E ] _ \. Note that, X depends on \, but we suppress that dependence for notational convenience. Clearly, X is linear. Since \ \ E ] Ÿ E ], X] Ÿ ] by the averaging property. Thus, X is a contraction operator. For any ] _ and 1 _\, using the pull-out property, ]X]ßX]1 œ!, where, denotes the inner product in _, and consequently,, m] 1 œm] X] mx] 1 showing that X] is the unique minimizer of m] 1 as 1 varies over _ \. Recall that ` is a closed subspace of _; let the orthogonal projection from _ to ` be denoted by C. Then, for any ] _ and 0 `, ` ` ` m] 0 œm] C ] m C ] 0, showing that C`] is the unique minimizer of m] 0 as 0 varies over `. Since

m] 1 À 1 _ \ œ m] 0 À 0 `, C`] equals X] outside of a T-null set in Y. Note that if the probability space H5 ß \ ßT is complete, so that the almost sure limit of a sequence of measurable functions is measurable, _ \ becomes a closed subspace of _, and in our identification of the conditional expectation as a projection, we can avoid the construction involving `. Let [ denote the two-dimensional linear space spanned by N and \, where N is the real valued function defined on HY ß ßT that is almost surely equal to "; since \ _, we obtain [ _ \ `. Putting the degenerate case when \ is a scalar multiple of N aside for the time being and applying the Gram-Schmidt orthonormalization process to the basis Nß\ of [, we obtain that Nß\, where \Nß\ N \ œ \Nß\ N, is an orthonormal basis of [. Let T denote the subspace of _ that consists of all ] _ such that X] [, that is, 4 holds for ]. If ] T, then C ] œ X] [ _ \ `, and consequently, ` X] œ C ] œ C C ] œ C ] œ Nß] N \ ß] \, 12 ` [ ` [ where C [ denotes the orthogonal projection from _ to [. In more familiar notation, 12 asserts E \ Cov\ß] ] œ E] E Var\ \ \, that is, 5 and 6 hold. Note that, if \ is a scalar multiple of N, that is, \ is a degenerate random variable and Var\ œ!, then [ is simply the span of N, and we obtain that X] œ Nß] \ N, that is, E ] œ E], leading to the conclusion of Theorem 1 for \ß] _. // Remark 2. Can the conclusion of Theorem 1, when we only have ] _" Ï _, \ _", and \] _ ", be obtained using the projection operator tools employed in Remark 1? The answer, as far as we can tell, is no. The domain of the map X can be expanded to _ ", causing the range to be expanded to _"\, the Banach space of real valued measurable functions 0 on Hß 5\ ßT such that E0. The linearity of X is not impacted by this expansion of domain. Since \ \ E ] Ÿ E ] (again by the Conditional Jensen's Inequality), X] Ÿ ] by " " the averaging property, where denotes the _ norm on, and by inheritance, on " " _" _ " \. Thus, X remains a contraction operator. In fact, as observed by Kallenberg

(2002), X on _ is uniformly _"-continuous and its extension to a linear and continuous map on is unique up to almost sure equivalence. _ " The definition of T can be extended to denote the subspace of _ " that consists of all ] _" such that X] [. Since [ is a finite-dimensional, hence closed, subspace of _, and X is continuous, T is a closed subspace of _. " " Let U denote the subspace of _" that consists of all ] _" such that \] _". If \ _, then, by the Cauchy-Schwartz inequality, _ U. Theorem 1 asserts that the representation of X as the orthogonal projection to [ that is valid on T _ can be extended to hold on T U. Given that the set inclusion relationship between the closure of T _ in _" and T U is generally indeterminable (if \ is bounded, then U œ _" and T œ T U, implying that the closure of T _ in _" is contained in T U, though the inclusion cannot be necessarily reversed), it seems that the result of Theorem 1, even when \ is bounded, can not be deduced using the technique of operator extension. What happens if \ _" Ï_? Clearly, [ is no longer a subspace of _. However, since _ is dense in _", we can find a sequence \ 8 À 8 " in _ such that \ 8 converges to \ in _". With [ 8 denoting the span of N and \ 8, the orthogonal projection of _ to [ 8 converges to the orthogonal projection of _ to the span of N. From the proof of Theorem 1, X on T U equals the "orthogonal projection," in a limiting sense, to the span of N. // Remark 3. What does the relaxation of the structural assumption from \ß] _ to ] _" Ï _, \ _", and \] _" entail? One can conceivably argue that the Cauchy- Schwartz inequality remains the primary tool for verifying that \] _ " when \ and ] are dependent random variables, and as such, this relaxation of assumption is neither insightful nor useful. While that argument may have some merit, we would like to point out that if \ is bounded, then obviously \ _, and ] _ Ï _ implies \] _. " " " A classic example of 4 holding for a bounded random variable \ is the Bernoulli random variable, since, for any measurable function 2\ of the Bernoulli random variable \, we have 2\ œ2! 2" 2! \, and E \ ] is a measurable function of \. Working with the non-central t-distribution with degrees of freedom, the following example presents \ß] such that \ _" but is not bounded, ] _" Ï _, \] _", and 4 holds. Let \ µ a!ß " and given \, [ µ a \ß ". Let Z µ ; be independent of [ß\. Let ] œ[îzî. Clearly, \ _ " but is not bounded. To verify that ] _" Ï _, we first obtain the marginal distribution of [. Using the form of the conditional density of [ given \ œ B and the form of the marginal density of \, we obtain that the joint density of [ß\ is given by

exp 0AßB œ A ABB ; 1 since BÈexpBAÎ 1 represents the density of the a A ß " distribution, completing the square in B we obtain the marginal density of [ to be A A A exp % expb exp % 0[ A œ 0AßB.B œ.b œ, d 1 d 1 1 showing that [ µ a!ß. Since Z œ Y, where Y is a > " random variable [Fabian and Hannan (1985, Definition 3.3.6)], by Theorem 3.3.9 of Fabian and Hannan (1985), " " " > E œ EY œ œ 1; 13 ZÎ > " since [ and Z are independent, ]. However, EÎZ œ EY œ, showing that ]Â_. _ " " To show that \] _ ", by independence of [ß\ and Z along with 13, it suffices to show that \[ _ ", which follows readily from the Cauchy-Schwartz inequality. Finally, by Corollary 7.1.2 of Chow and Teicher (1988) and 13, E [ß\ ] œ 1 [, implying, by the chain rule, E \ ] œ 1\, that is, 4 holds. ÎÎ References Chow, Y. S. and Teicher, H. (1988). Probability Theory, Independence, nd Interchangeability, Martingales. 2 Ed., Springer-Verlag, New York, NY. Dudley, R. M. (1989). Real Analysis and Probability. 1 Ed., Wadsworth & Brooks/Cole, Pacific Grove, CA. Fabian, V. and Hannan, J. (1985). Introduction to Probability and Mathematical Statistics, Wiley, New York, NY. Kallenberg, O. (2002). New York, NY. RAJESHWARI MAJUMDAR rajeshwari.majumdar@uconn.edu PO Box 47 Coventry, CT 06238 nd Foundations of Modern Probability. 2 Ed., Springer-Verlag, st