A Flexible Nonparametric Test for Conditional Independence

Size: px

Start display at page:

Download "A Flexible Nonparametric Test for Conditional Independence"

Elijah McLaughlin
5 years ago
Views:

1 A Flexible Nonparametric Test for Conditional Independence Meng Huang Freddie Mac Yixiao Sun and Halbert Wite UC San Diego August 5, 5 Abstract Tis paper proposes a nonparametric test for conditional independence tat is easy to implement, yet powerful in te sense tat it is consistent and acieves n = local power. Te test statistic is based on an estimator of te topological distance between restricted and unrestricted probability measures corresponding to conditional independence or its absence. Te distance is evaluated using a family of Generically Compreensively Revealing (GCR) functions, suc as te exponential or logistic functions, wic are indexed by nuisance parameters. Te use of GCR functions makes te test able to detect any deviation from te null. We use a kernel smooting metod wen estimating te distance. An integrated conditional moment (ICM) test statistic based on tese estimates is obtained by integrating out te nuisance parameters. We simulate te critical values using a conditional simulation approac. Monte Carlo experiments sow tat te test performs well in nite samples. As an application, we test an implication of te key assumption of unconfoundedness in te context of estimating te returns to scooling. We tank Graam Elliott, Dimitris Politis, Patrick Fitzsimmons, Jin Seo Co, Liangjun Su, James Hamilton, Andres Santos, Brendan Beare, and seminar participants at te University of California San Diego, Hong Kong University of Science and Tecnology, Peking University Guangua scool of management, and Bates Wite, LLC for elpful comments and suggestions. We are equally grateful for te constructive comments from Yuici Kitamura, te co-editor, and two anonymous referees. Special tanks to Liangjun Su for saring computer programs. Address correspondence to Meng Huang, Freddie Mac, 55 Park Run Dr., McLean, VA ; nkuangmeng@gmail.com or to Yixiao Sun, Department of Economics 58, University of California, San Diego, La Jolla, CA 993; yisun@ucsd.edu.)

2 Running ead: Nonparametric Test for Conditional Independence

3 Introduction In tis paper, we propose a exible nonparametric test for conditional independence. Let X; Y; and be tree random vectors. Te null ypotesis we want to test is tat Y is independent of X given, denoted by Y? X j : Intuitively, tis means tat given te information in, X cannot provide additional information useful in predicting Y. Dawid (979) sowed tat some simple euristic properties of conditional independence can form a conceptual framework for many important topics in statistical inference: su ciency and ancillarity, parameter identi cation, causal inference, prediction su ciency, data selection mecanisms, invariant statistical models, and a subjectivist approac to model-building. An important application of conditional independence testing in economics is to test a key assumption identifying causal e ects. Suppose we are interested in estimating te e ect of X (e.g., scooling) on Y (e.g., income), and tat X and Y are related by te equation Y = + X + U; were U (e.g., ability) is an unobserved cause of Y (income) and and are unknown coef- cients, wit representing te e ect of X on Y. (We write a linear structural equation ere merely for concreteness.) Since X is typically not randomly assigned and is correlated wit U (e.g., unobserved ability will a ect bot scooling and income), OLS will generally fail to consistently estimate. Neverteless, if, as in Grilices and Mason (97) and Grilices (977), we can nd a set of covariates (e.g., proxies for ability, suc as AFQT scores) suc tat U? X j ; () we can estimate consistently by various metods: covariate adjustment, matcing, metods using te propensity score suc as weigting and blocking, or combinations of tese approaces. Te assumption in () is a key assumption for identifying. It is called a conditional exogeneity assumption by Wite and Calak (8). It enforces te ignorability or unconfoundedness condition, also known as selection on observables (Barnow, Cain, and Goldberger, 98). Note tat te conditional independence assumption in () cannot be directly tested since U is unobservable. But if tere are oter observable covariates V satisfying certain conditions (see Wite and Calak, ), we ave U? X j implies V? X j ; so we can test te assumption in () by testing its implication, V? X j : Section 6 of tis paper applies tis test in te context of a nonparametric study of returns to scooling. In te literature, tere are many tests for conditional independence wen te variables are categorical. However, in economic applications it is common to condition on continuous variables, and tere are only a few nonparametric tests for te continuous case. Previous work on testing conditional independence for continuous random variables includes Linton and Gozalo (997, LG ), Fernandes and Flores (999, FF ), and Delgado and Gonzalez-Manteiga (, DG ). Su and Wite ave several papers (3, 7, 8,, SW ) addressing tis question. Altoug SW s tests are consistent against any deviation from te null, tey are only able to detect local alternatives converging to te null at a rate slower tan n = and ence su er from

4 te curse of dimensionality. We will compare our test wit te LG, DG and SW tests in our simulation study. Recently, Song (9) as proposed a distribution-free conditional independence test of two continuous random variables given a parametric single index tat acieves te local n = rate. Speci cally, Song (9) tests te ypotesis Y? X j () ; were () is a scalar-valued function known up to a nite-dimensional parameter, wic must be estimated. A main contribution ere is tat our proposed test also acieves n = local power, despite its fully nonparametric nature. In contrast to Song (9), te conditioning variables can be multidimensional. Te test is motivated by a series of papers on consistent speci cation testing by Bierens (98, 99), Bierens and Ploberger (997), and Stinccombe and Wite (998, StW ), among oters. Wereas Bierens (98, 99) and Bierens and Ploberger (997) construct te integrated conditional moment (ICM) tests tat essentially compare a restricted parametric model wit an unrestricted regression model, te test in tis paper follows a suggestion of StW, wic is based on te estimates of te topological distance between unrestricted and restricted probability measures, corresponding to conditional independence or its absence. Tis distance is measured indirectly by a family of moments, wic are te di erences of te expectations under te null and under te alternative for a set of test functions. Te cosen test functions make use of Generically Compreensively Revealing (GCR) functions, suc as te logistic or normal cumulative distribution functions (CDFs), and are indexed by a continuous nuisance parameter vector. Under te null, all moments are zeroes. Under te alternative, te moments are nonzero for essentially all coices of. Tis is in contrast wit DG (), wic employs an indicator testing function tat is not generically and compreensively revealing. We estimate tese moments by teir sample analogs, using kernel smooting. An ICM test statistic based on tese is obtained by integrating out te nuisance parameters. Its limiting null distribution is a functional of a mean zero Gaussian process. We simulate critical values using a conditional simulation approac suggested by Hansen (996) in a di erent setting. Our GCR approac requires bounded random variables. Wen any random variable is not bounded, we rst standardize it on te basis of estimated location and scale parameters and ten apply a bounded and invertible transformation to te standardized data. Te location and scale parameters can be te mean and standard deviation or oter more robust measures suc as te median and interquartile range, respectively. Te plan of te paper is as follows. In Section, we specify a family of moment conditions wic is (essentially) equivalent to te null ypotesis of conditional independence and forms a basis for our test. In Section 3, we establis stocastic approximations of te empirical moment conditions uniformly over te nuisance parameters. We derive te nite-dimensional weak convergence of te empirical moment process. We also provide a bandwidt coice for practical use: a simple plug-in estimator of te MSE-optimal bandwidt. In Section 4, we formally introduce and analyze our ICM test statistic. In Section 5, we report some Monte Carlo results, examining te size and power properties of our test and comparing its performance wit tat of a variety of oter tests in te literature. In Section 6, we study te returns to scooling, using te proposed statistic to test an implication of te key assumption of unconfoundedness. Te last section concludes. Te appendix contains te proofs of te main results and sows tat te estimation errors in te location and scale parameters ave no impact on our asymptotic teory.

5 Te Null Hypotesis and te Testing Approac. Te Null Hypotesis Let X, Y, and be tree random vectors, wit dimensions d X, d Y ; and d, respectively. Denote W = (X ; Y ; ) R d wit d = d X + d Y + d : Given an IID sample fx i ; Y i ; i g n i=, we want to test te null tat Y is independent of X conditional on, i.e., H : Y? X j ; () against te alternative tat Y and X are dependent conditional on, i.e., H a : Y 6? X j : Let F Y jx (y j x; z) be te conditional distribution function of Y given (X; ) = (x; z) and F Y j (y j z) be te conditional distribution function of Y given = z. Ten we can express te null as F Y jx (yjx; z) = F Y j (yjz): (3) Te following tree expressions are equivalent to one anoter and to (3): F XjY (xjy; z) = F Xj (xjz); (4) F XY j (x; yjz) = F Xj (xjz)f Y j (yjz); (5) F XY (x; y; z)f (z) = F X (x; z)f Y (y; z); (6) were we ave used te standard notations for distribution functions. Te approac adopted in tis paper is inspired by a series of papers on consistent speci cation testing: Bierens (98, 99), Bierens and Ploberger (997), and StW, among oters. Te tests in tose papers are based on an in nite number of moment conditions indexed by nuisance parameters. Consider, as an example, te conditional mean function g (x) = E (Y j X = x). Bierens (99) tests te ypotesis tat te parametric functional form, f (x; ), is correctly speci ed in te sense tat g (x) = f (x; ) for some i. Te test statistic is based on an estimator of a family of moments E (Y f(x; ))e X indexed by a nuisance parameter vector. Under te null ypotesis of correct speci cation, tese moments are zeroes for all. Bierens s (99) Lemma sows tat te converse essentially olds, due to te properties of te exponential function, making te test capable of detecting all deviations from te null. Te similar idea is used in Bierens and Wang () for testing a parametric speci cation of te conditional distribution function. StW nd tat a broader class of functions as tis property. Tey extend Bierens s result by replacing te exponential function in te moment conditions wit any GCR function, and by extending te probability measures considered in te Bierens (99) approac to signed measures. As stated in StW, GCR functions include non-polynomial real analytic functions, e.g., exp, logistic CDF, sine, cosine, and also some nonanalytic functions like te normal CDF or its density. Furter, tey point out tat suc speci cation tests are based on estimates of topological distances between a restricted model and an unrestricted model. Following tis idea, we can construct a test for conditional independence based on estimates of a topological distance between unrestricted and restricted probability measures corresponding to conditional independence or its absence. 3

6 To de ne te GCR property formally, let C(F ) be te set of continuous functions on a compact set F R d ; and sp [H ' ] be te linear span of a collection of functions H ' ( ): We write ~w := (; w ) : Te de nition below is te same as De nition 3.6 in StW. De nition (StW, De nition 3.6) We say tat H ' = fh : R d! R j H(w) = ' ( ~w ) ; R +d g is generically compreensively revealing if for all wit non-empty interior, te uniform closure of sp[h ' ] contains C(F ) for every compact set F R d. Intuitively, GCR functions are a class of functions indexed by wose span comes arbitrarily close to any continuous function, regardless of te coice of ; as long as it as nonempty interior. Wen tere is no confusion, we simply call ' GCR if te generated H ' is GCR. We now establis an equivalent ypotesis in te form of a family of moment conditions following StW. Let P be te joint distribution of te random vector W, and let Q be te joint distribution of W wit Y? X j. Tus, P is an unrestricted probability measure, wereas Q is restricted. To be speci c, P and Q are de ned suc tat for any event A, P (A) df XY (x; y; z) = df XY j (x; yjz)df (z) (7) and A Q(A) A A df Xj (xjz)df Y j (yjz)df (z): (8) Note tat te measure P will be te same as te measure Q if and only if te null is true: P (A) = df XY j (x; yjz)df (z) H = df Xj (xjz)df Y j (yjz)df (z) = Q(A) A A for all Borel sets A: Testing te null ypotesis is tus equivalent to testing weter tere is any deviation of P from Q. It sould be pointed out tat te marginal distribution of is te same under P and Q regardless of weter te null is true or not. Let E P and E Q be te expectation operators wit respect to te measure P and te measure Q. De ne ' () E P '( W ~ i ) E Q '( W ~ i ) ; were ( ; ; ; 3 ) R +d is a vector of nuisance parameters, W ~ = (; W ) ; and ' is suc tat te indicated expectations exist for all. Under te null ypotesis, ' () is obviously zero for any coice of and any coice of '; including GCR functions. To construct a powerful test, we want ' () to be nonzero under te alternative. If ' ( ) is not zero under some alternative, we say tat ' can detect tat particular alternative for te coice =. An arbitrary function ' may fail to detect some alternatives for some coices of. Neverteless, according to StW, if W is a bounded random vector, te properties of GCR functions imply tat tey can detect all possible alternatives for essentially all R +d wit aving non-empty interior. Essentially all means tat te set of bad s, i.e., te set f : ' () = and Y 6? X j g as Lebesgue measure zero and is not dense in. Given tat any deviation of P from Q can be detected by essentially any coice of, testing H : Y? X j is equivalent to testing H : ' () = for essentially all (9) 4

7 for a GCR function ' and a set wit non-empty interior. Te alternative is H a : H is false. A straigtforward testing approac is to estimate ' () and to see ow far te estimate is from zero. However, tere are two tecnical issues. First, te result of StW requires W to be bounded. To acieve te boundedness, we can replace eac element V of W by V (V ), were V is a bounded one-to-one mapping wit Borel measurable inverse. De ne Y (Y ) = ( Y (Y ) ; : : : ; Y dy (Y dy )) and de ne X (X) and () similarly. Ten Y? X j is equivalent to Y (Y )? X (X) j () : Te equivalence olds because te sigma elds are not a ected by te transformation. So it is innocuous to assume tat W as a bounded support. In practice, we recommend coosing a bounded set, say [a; b] d ; to closely matc te support of te GCR function we use. Depending on weter te random variables ave bounded supports or not, we can acieve tis using a di erent transformation. As an example, suppose tat te support of V is a bounded interval, say [v min ; v max ] for some known nite constants v min and v max : Ten we can take V (V ) = a + (b a) (V v min) ; () v max v min wic obviously meets our requirement. If te end points of te support are not known, we can estimate tem by ^v min = min i=;:::;n (V i ) and ^v max = max i=;:::;n (V i ) and plug te estimates into (). Under some mild conditions, ^v min and ^v max converge to te true endpoints at te fast rate of =n. As a result, we can sow tat te estimation uncertainty as no e ect on our asymptotic results. Wen te random variable V as an unbounded support, we rst standardize it and ten take V (V ) = a + (b a) arctan ((V v) = v ) + :5 ; () were v and v are location and scale parameters. For example, v and v can be te mean and standard deviation of V: By construction, V (V ) [a; b]. We can also use oter bounded functions suc as V (V ) = a+(b a)f ((V v ) = v ) for a CDF F. Te standardization ensures tat V (V ) does not reside in a small subset of [a; b]: See Bierens and Wang () for more discussion. Wen v and v are not known, we can estimate tem by ^ v and ^ v respectively and plug tem into (). In Section 8., we make te transformation explicit and sow tat under some mild conditions including te p n-consistency of ^ v and ^ v ; te estimator errors in ^ v and ^ v ave no impact on te asymptotic properties of our proposed test. Here for notational simplicity we leave tis transformation implicit and assume tat P (W [a; b] d ) =. Witout loss of generality, we let a = and b = for our teoretical development. Te second issue is related to te nonparametric estimation of ' () : It involves a nonparametric estimator ^f of te density f in te denominator of te test statistic, making te analysis of limiting distributions awkward. To avoid tis tecnical issue, we compute te expectations of 'f rater tan tose of ', leading to a new distance metric between P and Q: 'f () = E P '( W ~ i )f () E Q '( W ~ i )f () : Using te cange-of-measure tecnique, we ave n 'f () = C E P '( W ~ i ) E Q '( W ~ io ) ; 5

8 were P and Q are probability measures de ned according to P (A) = f (z)df XY j (x; yjz)df (z)=c and A Q (A) = f (z)df Xj (xjz)df Y j (yjz)df (z)=c () A wit C = R f (z) dz being te normalizing constant. Under te null of H : Y? X j ; P and Q are te same measure, and so 'f () = for all : Under te alternative of H a : Y 6? X j ; P and Q are di erent measures. By de nition, if ' is GCR, ten its revealing property olds for any probability measure (see De nition 3. of StW). So under te alternative, we ave 'f () 6= for essentially all : Te beaviors of 'f () under H and H a imply tat we can employ 'f () in place of ' () to perform our test. To sum up, wen ' is a GCR function, W as bounded supports, as non-empty interior, and R f (z) dz < ; a null ypotesis equivalent to conditional independence is H : 'f () = for essentially all : Tat is, te null ypotesis of conditional independence is equivalent to a family of moment conditions indexed by. For notational simplicity, we drop te subscript and write () := 'f () ereafter.. Heuristics for Rates Wen te probability density functions exist, te conditional independence is equivalent to any of te following: f Y jx (yjx; z) = f Y j (yjz); f XjY (xjy; z) = f Xj (xjz); f XY j (x; yjz) = f Xj (xjz)f Y j (yjz); f XY (x; y; z)f (z) = f X (x; z)f Y (y; z); (3) were te notation for density functions is self-explanatory. One way to test conditional independence is to compare te densities in a given equation to see if te equality olds. For example, Su and Wite s (8) test essentially compares f XY f wit f X f Y. To do tat, tey estimate f XY ; f ; f X ; and f Y nonparametrically, so teir test as power against local alternatives at a rate of only n = d=4, te slowest rate of te four nonparametric density estimators, i.e., te rate for ^f XY. Tis rate is slower tan n = and ence re ects te curse of dimensionality. Te dimension ere is d = d X + d Y + d, wic is at least tree and could potentially be larger. To acieve te rate n =, we do not compare te density functions directly. Instead, our family of moment conditions indirectly measures te distance between f XY f and f X f Y, so tat for eac given, te test statistic is based on an estimator of an average tat can acieve an n = rate, just as a semiparametric estimator would. To better understand te moment conditions of te equivalent null, we write () = '( ~w ) f (z) f XY (x; y; z) dxdydz '( ~w )f Y (y; z) f X (x; z) dxdydz: 6

9 Instead of comparing f XY f wit f Y f X, we now compare teir integral transforms. Before te transformation, f XY f and f Y f X are functions of (x; y; z), te data points, and tose functions can only be estimated at a nonparametric rate slower tan n =. But teir integral transforms are now functions of. For eac, te transformation is an average of te data so tat semiparametric tecniques could be used ere to get an n = rate. Essentially, we compare two functions by comparing teir weigted averages. Te two comparisons are equivalent because of te properties of te cosen test functions. Tat is, if we coose GCR functions for our test functions, de ned on a compact index space wit non-empty interior, and we do not detect any di erence between P and Q transforms at essentially any point, ten P and Q must agree, and as a consequence P and Q must agree. We gain robustness by integrating over many points :.3 Empirical Moment Conditions Wit some abuse of notation, we write '( + x + y + z 3 ) '(x; y; z; ): De ne g X (x; z; ) = E ['(x; Y; z; )j = z] = '(x; y; z; )f Y j (yjz)dy: (4) Ten te moment conditions can be rewritten as () = E ['(X; Y; ; )f ()] E [g X (X; ; )f ()] : Te rst term of () is a mean of 'f, were ' is known and f can be estimated by a kernel smooting metod. Te second term is a mean of g X f (), were te function g X (x; z; ) is a conditional expectation tat can be estimated by a Nadaraya-Watson estimator: ^g X (x; z; ) = P n j= '(x; Y j; z; )K ( j z) P n j= K ( j z) Tus we can estimate () by ^ n; () = = = = n n i= i= n (n ) n (n ) '( W ~ i ) ^f i ( i ) ['( ~ W i ) n i= j= n K ( j i ) j= ^g X (X i ; i ; ) ^f ( i ) i= '( W ~ i ) '( W ~ i i;j) i= j=;j6=i n (n ) K ( j i ) i= j= '(X i ; Y j ; i ; )K ( j i ) f['( W ~ i ) '( W ~ i;j)]k ( i j )g; (5) 7

10 were W ~ i;j = + Xi + Yj + i 3 and K (u) is a multivariate kernel function. In tis paper, we follow te standard practice and use a product kernel of te form: K (u) = K u du ; : : : ; u d u wit K (u ; : : : ; u du ) = d u Y `= k(u`); were d u is te dimension of u and n is te bandwidt tat depends on n. ^ n; () is an empirical version of (): For eac ; ^ n; () is a second order U-statistic. Wen ^ n; () is regarded as a process indexed by ; ^n; () is a U-process. Note tat ['( W ~ i ) '( W ~ i;j ( i j ) is not symmetric in i and j: To acieve te symmetry so tat te teory of U-statistics and U-processes can be applied, we rewrite ^ n; () as ^ n; () = n X ; (W i ; W j ; ); (6) i<j were ; (W i ; W j ; ) = '( W ~ i ) '( W ~ i i;j) K ( i j ) + '( W ~ j) '( W ~ i j;i) K ( j i ) = ; (W j ; W i ; ): Our test statistic is related to wat DG (, Sec 5.) propose but no formal proof is given tere. DG formulate te null ypotesis as for all ( ; ; 3 ) in te support of (X ; Y ; ) were H : L( ; ; 3 ) = (7) L( ; ; 3 ) = E (Y ) E (Y ) j (X) 3 () f () and v (V ) = fv vg is te indicator function. Te DG statistic is based on te following estimator of L( ; ; 3 ): = = ^L n; ( ; ; 3 ) n(n ) n(n ) i= j=;j6=i i= j=;j6=i (Y i ) (Y j ) (X i ) 3 ( i ) K ( j i ) (X i ) (Y i ) 3 ( i ) (X i ) (Y j ) 3 ( i ) K ( j i ) : Comparing tis wit ^ n; () given in (5), we can see tat ^L n; () takes te same form as ^ n; (). Te di erence is tat we use a GCR function '(x; y; z; ) wile DG use te indicator function (x) (y) 3 (z) : Tis as bot teoretical and practical implications. First, from a teoretical point of view, wile te indicator function is compreensively revealing, it is not generically and compreensively revealing. An advantage of using a GCR function is tat te alternative ypotesis can be revealed by essentially all : Tis property does not old for te 8

11 indicator function, i.e., tere may exist a region of ( ; ; 3 ) wit nonempty interior suc tat te moment condition in (7) olds but Y 6? X j : Second, te GCR approac requires bounded random variables wile te DG approac does not. So in our setting, te boundary smooting bias cannot be avoided and as to be dealt wit explicitly and rigorously. DG assume unbounded supports and so tey do not ave to deal wit te boundary problem. Tird, tere can be a practical problem wen computing some functionals of ^L n; ( ; ; 3 ) : For instance, te Cramér-von Mises statistic in DG () takes te form C n = ^L n; (X i; Y i ; i ) i= were by de nition ^L n; (X i; Y i ; i ) = n(n ) k= j=;j6=k [ Yi (Y k ) Yi (Y j )] Xi (X k ) i ( k ) K ( j k ) : Te DG test rejects te null wen C n is larger tan an asymptotically valid critical value. Wen te dimensions of X and are large, X k X i and k i for k 6= i may never appen and Xi (X k ) i ( k ) may never be di erent from zero, even in large samples. In tis case ^L n; (X i; Y i ; i ) is zero. Tis could ave adverse e ects on te size accuracy and te power property of te test in nite samples. Tis is supported by our Monte Carlo study. A desirable property of te DG test is tat it is invariant to strictly monotonic transformations of fx i g and fy i g : To see tis, let m X (X) = (m X (X ) ; :::; m XdX (X dx )) and m Y (Y ) = (m Y (Y ) ; :::; m YdY (Y dy )) were all te univariate functions m Xk () and m Yk () are strictly monotonic. Ten [ Yi (Y k ) Yi (Y j )] Xi (X k ) = my (Y i )(m Y (Y k )) my (Y i )(m Y (Y j )) mx (X i )(m X (X k )): As a result, ^L n; (X i; Y i ; i ) is invariant to strictly monotonic transformations of fx i g and fy i g : Tis is an appealing property tat is not sared by te GCR test. More broadly, one may argue tat a conditional independence test sould be invariant to measurable and invertible transformations of all te variables fx i g ; fy i g ; and f i g : Te DG test is not invariant under tis stronger notion of invariance. However, tests based on nonparametric estimators of some divergence measures between probability distributions suc as Sannon s entropy metric or Hellinger s distance measure may be invariant in te stronger sense. An example of suc a test is SW (8) wic is based on a weigted Hellinger distance between f XY (x; y; z) f (z) and f X (x; z) f Y (y; z): Like te GCR test, te omnibus tests of LG (997) and SW (7) are not invariant to measurable and invertible transformations or strictly monotonic transformations. Wile transformation invariance is a pleasant property, te invariance requirement is often invoked to reduce te class of tests under consideration so tat uniformly most powerful invariance tests (UMPI) can be designed. However, in te present context, witout specifying te direction of local alternatives, tere is no uniformly most powerful test even among te class of invariance tests. We avoid coosing a direction in order to edge against te possibility of aving no power in oter directions. Te lack of a UMPI test makes te invariance requirement not as compelling as in some oter settings were a UMPI test exists. We view te GCR test and oter tests, invariant or not, as complementary. For example, wile 9

12 te SW (8) test, wic is invariant, can be powerful in detecting ig-frequency alternatives, it su ers from te curse of dimensionality, as discussed above. In contrast, te GCR test, wic is not invariant, does not su er from te curse of dimensionality and is more powerful against low-frequency departures. In addition, te GCR test may be made invariant to strictly monotonic transformations if we convert te data fx i g and fy i g to ranks before applying te GCR test. A systematic study of tis rank-based GCR test is beyond te scope of tis paper. 3 Stocastic Approximations and Finite Dimensional Convergence 3. Assumptions In tis subsection, we state te assumptions tat are required to establis te asymptotic properties of ^n; (): We start wit a de nition, wic uses te following multi-index notation: for j = (j ; : : : ; j m ) wit j` being nonnegative integers, we denote jjj = j + j + + j m ; j! = j! j m!, u j = u j ujm m ; and D j g(u) jjj g(u)=@u jm m : De nition G (A; ; ; m), >, is a class of functions g () : R m! R indexed by A satisfying te following two conditions: (a) for eac ; g () is b times continuously di erentiable, were b is te greatest integer tat is smaller tan ; (b) let Q (u; v) be te Taylor series expansion of g (u) around v of order b : ten sup Q (u; v) = A ku sup vk for some constants > and > : X j:jjjb;j6= D j g (v) j! (u v) j kg (u) g (v) Q (u; v)k ku vk In te absence of te index set A, we use G (; ; m) to denote te class of functions. In tis case, our de nition is similar to De nition in Robinson (988) and De nition in DG (). A su cient condition for condition (b) is tat te partial derivative of te b-t order is uniformly Hölder continuous: sup sup D j g (u) D j g (v) b A kv uk for all j suc tat jjj = b: We are ready to present our assumptions. Assumption (IID) (a) fw i [; ] d g n i= is an IID sequence of random variables on te complete probability space (; F; P ) ; (b) eac element ` of is supported on [; ]; (c) te distribution of admits a density function f (z) wit respect to te Lebesgue measure. Assumption (Smootness of te Densities) (a) f () G q+ (; ; d ) for some integer q > and some constants > and > ; (b) D j f (z) = for all jjj q and all z on te boundary of [; ] d ; (c) te conditional distribution functions F Y j ; F Xj ; and F XY j admit

13 te respective densities f Y j (yjz); f Xj (xjz); and f XY j (x; yjz) wit respect to a nite counting measure, or te Lebesgue measure or teir product measure; (d) as functions of z indexed by x; y; or (x; y) A; f Xj (xjz); f Y j (yjz) and f XY j (xjz) belong to G q+ (A; ; ; d ) wit A = [; ] d X ; [; ] d Y or [; ] d X+d Y. Assumption 3 (GCR) (a) is compact wit non-empty interior; (b) ' G (; ; ). Assumption 4 (Kernel Function) Te univariate kernel k () is te qt order symmetric and bounded kernel k : R! R suc tat (a) R k(v)dv =, R v j k(v)dv = for j = ; ; : : : ; q ; (b) k (v) = O(( + jvj ) ) for some > q + q + : Assumption 5 (Bandwidt) Te bandwidt = n satis es (a) n d! as n! ; (b) p n q = o(); i.e., = o(n =(q) ) as n! : Some discussions on te assumptions are in order. Te IID condition in Assumption is maintained for convenience. Analogous results old under weaker conditions, but we leave explicit consideration of tese aside. Assumptions (a) and (d) are needed to control te smooting bias. Under Assumptions (b) and (a), we ave R f (z) dz < : So it is not necessary to state te square integrability of f (z) as a separate assumption. In assumption (d), te smootness condition is wit respect to te conditioning variable. It does not require te marginal distributions of X and Y to be smoot. In fact, X and Y could be eiter discrete or continuous. In addition, from a tecnical point of view, we only need to assume tat tere exists a version of te conditional density functions satisfying Assumption (d). Assumption (b) is a tecnical condition, wic elps avoid te boundary bias problem, a well-known problem for density estimation at te boundary. Te GCR approac of StW requires te boundedness of te random vectors, so we ave to deal wit te boundary bias problem. If Assumption (b) does not old, we can transform into = ( ( ) ; ( ) ; : : : ; ( d )) ; were : [; ]! [; ] is strictly increasing and q + times continuously di erentiable wit inverse. Now P f < zg = P f < (z ) ; : : : ; d < (z d )g = F ( (z ) ; : : : ; (z d )) ; and te density of is f (z) = f ( (z)) (z ) : : : (z d ) : So if (i) () = (i) () = for i = ; : : : ; q; ten Assumption (b) is satis ed for te transformed random vector and we can work wit rater tan : We can do so because Y? X j if and only if Y? X j : An example of is te CDF of a beta distribution: (v) = B(q + ; q + ) v x q ( x) q dx := B(v; q + ; q + ) B(; q + ; q + ) were B(v; q + ; q + ) = R v xq ( x) q dx is te incomplete beta function. Te idea of using transformations to remove te boundary bias as a long istory; see Geenens (4) and te references terein. However, our idea is di erent ere. In Geenens (4) and te related literature, in order to reduce te boundary bias, a transformation is employed to map a

14 bounded support into an unbounded support. Tis approac clearly is not compatible wit te GCR framework we use ere. Our idea is to transform te data so tat te probability mass around te boundary is relatively small. Tis is a viable approac as long as our focus is not te pdf of te original data per se. Tis idea may be of independent interest. Anoter often used approac to andle te boundary problem is to trim te data wen is boundedly supported. Let " be a proper subset of te support of satisfying P [ i " ] = " for some " approacing zero at a certain rate. In tis approac, we replace K ( i j ) in te de nition of ^ n; () by K ( i j ) [ i " ] [ j " ]. For example, tis approac is used in Li and Fan (3, page 745) in a di erent context. Note tat te use of transformation or trimming is more of teoretical importance, as tey are necessary to acieve te parametric p n rate of convergence of ^ n; () to (). In practice, transformation and trimming may or may not be important. Wen te probability mass near te boundary is relatively small, tey are likely to be unimportant. In tis case, we may skip te transformation or trimming and ignore te boundary bias in practice. Assumption 3(a) is needed only wen we attempt to establis te uniformity of some asymptotic properties over : Like Assumption, Assumption 3(b) elps control te smooting bias. It is satis ed by many GCR functions suc as exp () ; normal PDF, sin () ; and cos (). Te conditions on te ig order kernel in Assumption 4 are fairly standard. For example, bot Robinson (988) and DG () make a similar assumption. Te only di erence is tat Robinson (988) and DG () require tat > q +; wile we require a stronger condition tat > q +q + in Assumption 4(b). Te stronger condition is needed to control te boundary bias, wic is absent in Robinson (988) and DG (), as tey assume tat as an unbounded support. Assumption 4(b) is not restrictive. It is satis ed by typical kernels used in practice, as tey are eiter supported on [; ] or ave exponentially decaying tails. Assumption 5(a) ensures tat te degenerate U-statistic in te Hoe ding decomposition of ^ n; () is asymptotically negligible. Assumption 5(b) removes te dominating bias of ^ n; (): See Lemmas 3 and 4 below. A necessary condition for Assumption 5 to old is tat q > d. 3. Stocastic Approximations To establis te asymptotic properties of ^ n; (); we develop some stocastic approximations, using te teory of U-statistics and U-processes pioneered by Hoe ding (948). Let ; (w; ) = E ; (w; W j ; ): Using Hoe ding s H-decomposition, we can decompose ^ n; () as ^ n; () = () + H n; () + R n; (); were () = E ; (W j ; W i ; ) = E ; (W i ; ) (8) H n; () = ~ ; (W i ; ) (9) n R n; () = i= n X i<j ~ ; (W i ; W j ; ) ()

15 and ~ ; (W i ; ) = ; (W i ; ) () ~ ; (W i ; W j ; ) = ; (W i ; W j ; ) ; (W i ; ) ; (W j ; ) + () : Te sum of te rst two terms in te H-decomposition is known as te Hájek projection. For easy reference, we denote it as ~ n; () = () + H n; (): () By construction, H n; () and R n; () are uncorrelated zero mean random variables. We sow tat te projection remainder R n; () is asymptotically negligible, and as a result ^ n; () and its Hájek projection ~ n; () ave te same limiting distribution. For eac given and ; R n; () is a degenerate second order U-statistic wit kernel ~ ; (; ; ) : According to te teory of U-statistics (e.g., Lee, 99), we ave var [R n; ()] = n(n ) var [~ ;(W i ; W j ; )] : Tis can also be proved directly by observing tat ~ ; (W i ; W j ; ) is uncorrelated wit ~ ; (W`; W m ; ) if (i; j) 6= (`; m) : If were xed, ten it follows from te basic U-statistic teory tat R n; () = o p (= p n) for eac : However, in te present setting,! as n!, so te basic U-statistic teory does not directly apply. Neverteless, we can still sow tat R n; () is still o p n = under Assumption 5(a). In fact, we can prove a stronger result, as Lemma 3 sows. Lemma 3 Under Assumptions 5(a), if! as n!, ten sup p nrn; () = o p (). We proceed to establis a stocastic approximation of te Hájek projection ~ n; (). Note tat bot () and H n; () depend on. Using a Taylor expansion, we can separate terms independent of from tose associated wit in () and H n; (). By using a iger order kernel K and controlling te rate of so tat it srinks fast enoug, we can ensure tat te terms associated wit vanis asymptotically, as in Powell, Stock, and Stoker (989). More speci cally, we rst sow tat () = () + O( q ), were q is te order of te kernel k. Ten we sow tat H n; () = n P n i= f (W i ; ) E [ (W i ; )]g + O p ( q ), were (W i ; ) '( + Xi + Yi + i 3 )f ( i ) '( + Xi + y + i 3 ) f Y (y; i )dy + '( + x + y + i 3 ) f XY (x; y; i )dxdy '( + x + Yi + i 3 ) f X (x; i )dx: Under Assumption 5(b), p n q!, wic makes bot te second term of () and te second term of H n; () vanis asymptotically. Te following lemma presents tese results formally. Lemma 4 Let Assumptions 4 and 5(b) old. Ten 3

16 (a) p n [ () ()] = o () uniformly over ; (b) p nh n; () = = p n P n i= f (W i ; ) E [ (W i ; )]g + o p () uniformly over : It follows from Lemmas 3 and 4 tat p n ^n; () ()i = p nh n; () + p nr n; () + p n [ () ()] = p nh n; () + o p () = p f (W i ; ) E [ (W i ; )]g + o p () n uniformly over : So p i n ^n; () () te same limiting distribution for eac : 3.3 Finite Dimensional Convergence i= and = p n P n i= f (W i ; ) E [ (W i ; )]g ave In tis subsection, we view ^ n; () as a U-process indexed by and consider its nite-dimensional convergence. Let s = f ; ; :::; s g for some s < and ` ; and de ne ^ n; ( s ) := [ ^ n; ( ); ^ n; ( ); :::; ^ n; ( s )] : Similarly, we de ne ( s ) := [( ); ( ); :::; ( s )]. Teorem 5 below establises te asymptotic normality of p i n ^n; ( s ) ( s ). Teorem 5 Let Assumptions 5 old. Ten p i d! n ^n; ( s ) ( s ) N (; ) ; were te (`; m) element of is (`; m) := (`; m ) = 4cov [ (W i ; `); (W i ; m )] : () If, in addition, H olds, ten () =, and were (`; m ) = 4E [(W i ; `)(W i ; m )] ; (W i ; ) = E '( ~ W i )f ( i )jx i ; Y i ; i i E '( W ~ i i )f ( i )jx i ; i E '( W ~ i i )f ( i )jy i ; i + E '( W ~ i i )f ( i )j i : Teorem 5 is of interest in its own rigt. For example, we can use it to construct a Wald test. Tere may be some power loss if s is small. Wen s is large enoug suc tat s approximates very well, ten te power loss will be small. Te idea can be motivated from te metod of sieves. We do not pursue tis ere but refer to Huang (9) for more discussions. Instead, we consider te ICM tests in te next section. Teorem 5 is an important rst step in obtaining te asymptotic distributions of te ICM statistics. 4 (3)

17 Observe tat ^ n; () is not symmetric in X and Y; wereas te ypotesis Y? X j is. However, p n[ ^ n; () ()] is asymptotically equivalent to = p n P n i= [ (W i ; ) E (W i ; )] : It can be readily cecked tat (W ; ) is symmetric in Y and X. Alternatively, we can follow te de nition of g X in (4) and de ne g Y (y; z; ), g (z; ), and g XY (x; y; z; ) as g Y (y; z; ) = E [' (X; y; z; ) j = z] g (z; ) = E [' (X; Y; z; ) j = z] g XY (x; y; z; ) = E [' (x; y; z; ) j = z] = ' (x; y; z; ) were te last equality is tautological. Ten (W ; ) = [g XY (X; Y; ; ) g X (X; ; ) g Y (Y; ; ) + g (; )] f () ; wic is clearly symmetric in Y and X: If we construct anoter estimator, say ^ n; (); by switcing te roles of X and Y, we can sow tat ^ n; and ^ n; () are asymptotically equivalent in te sense tat p n[ ^ n; ^ n; ()] = o p () uniformly over : So tere is no asymptotic gain in taking an average of ^ n; () and ^ n;. Tis point is furter supported by te symmetry of (W ; ) in X and Y: 3.4 Bandwidt Selection Altoug any coice of bandwidt satisfying Assumption 5 will deliver te asymptotic distribution in Teorem 5, in practice we need some guidance on ow to select. Ideally we sould select an tat would give us te greatest power for a given size of test, but deriving tat procedure would be complicated enoug to justify anoter study. Moreover, it would only make a di erence for iger order results. Tus, for te present purposes, we just provide a simple plug-in estimator of te MSE-minimizing bandwidt proposed by Powell and Stoker (996). Since te test statistic is based on ^ n; (), wic estimates (), it is appealing to coose an tat minimizes te mean squared error (MSE) of ^ n; (). After some tedious but straigtforward calculations, we get i i MSE ^n; () = ( () ()) + var ^n; () i = fe [B 5 (W ; )]g q + o( q ) + var ^n; () = fe [B 5 (W ; )]g q + o( q ) +4n var [ (W ; )] + 4n C () q + o(n q ) 4n var [ (W ; )] + n E [ (W ; )] d +o n d n () + o(n ); were B 5 is de ned in (47) in te appendix, and (W ; ) is de ned by i E k ; (W i ; W j ; )k jw i = (W i ; ) d + (W i ; ; ) ; were E (k (W i ; ; )k) = o d : 5

18 Te term 4n var [ (W ; )] 4n var [ (W ; )] does not depend on. Te term n () must be of smaller order tan 4n C q, and 4n C q must be of smaller order tan fe [B 5 (W ; )]g q ; oterwise tere would be a contradiction to Assumption 5(b). So te leading term of MSE[ ^ n; ()] tat involves is i MSE ^n; () = fe [B 5 (W ; )]g q + n E [ (W ; )] d : (4) i By minimizing MSE ^n; () ; we obtain te optimal bandwidt Now Assumption 5(a) is satis ed: And so is Assumption 5(b): d E [ (W ; )] = q fe [B 5 (W ; )]g =(q+d ) =(q+d ) : (5) n n ( ) d n d =(q+d ) n (q d )=(q+d )! ; given q > d : p n ( ) q n = q=(q+d ) n (q d )=(q+d ) = o(); given q > d : Te optimal bandwidt depends on te unknown quantities E [ (W ; )] and E [B 5 (W ; )]. Here we follow te standard practice (e.g., Powell and Stoker (996)) and use a simple plug-in estimator of : Let be an initial bandwidt. Suppose E ; (W i ; W j ; ) 4 = O( d ) for some >, and let % = max f + d ; q + d g. If! and n %!, ten by Proposition 4. of Powell and Stoker (996), ^ ^ ( ) = n X d [ ;(W i ; W j ; )] p! E [ (Wi ; )] ; (6) i<j and ^B 5 ^ n; () ( ) q q ^n; () for some < 6= p! E [B 5 (W ; )] : Te estimator ^B 5 given above is a slope between two points ( q ; ^ n; ()) and ( q ; ^ n; ()). To get a more stable estimator, we could use a regression of ^ n; () on q for various values of. Given ^ and ^B 5 ; te plug-in estimator of is " d ^ = ^ q ^B 5 # =(q+d ) =(q+d ) : (7) n In practice we can coose q large enoug so tat % = maxf + d, q + d g = q + d ; ten we can coose te initial bandwidt to be = o n =(q+d ). Te data driven ^ depends on. We may coose di erent bandwidts for di erent s. Tis is wat we follow in our Monte Carlo experiments. Powell and Stoker (996) mention one tecnical proviso: ^n (; ^) is not guaranteed to be asymptotically equivalent to ^ n (; ) since te MSE calculations are based on te assumption tat is deterministic. Te suggested solution is to discretize te set of possible scaling constants, 6

19 replacing ^ wit te closest value, ^ y, in some nite set. Te estimation uncertainty in ^ y is small enoug tat it will not a ect te asymptotic MSE. 4 An Integrated Conditional Moment Test In tis section, we integrate out to get an integrated conditional moment (ICM) type test statistic, following Bierens (99) and StW (998). 4. Te Test Statistic and its Asymptotic Null Distribution If ' is GCR, testing H : Y? X j is equivalent to testing H : () = for essentially all : In oter words, if we view ^ n; () as a random function in, we are testing weter its mean function () is zero on. If is compact, we can sow tat p n ^ n; () converges to a zero mean Gaussian process under te null. Based on p n ^ n; (), we construct te ICM test statistic M n = n ^n; ()i d () ; were is a probability measure on tat is absolutely continuous wit respect to te Lebesgue measure on. Here we integrate [ ^ n; ()] ; wic gives a Cramer-von Mises (CM) type test. Alternatively, we could integrate j ^ n; ()j p ; p : Te coice p = (wic gives te maximum over ) yields a Kolmogorov-Smirnov (KS) type test. We work wit p = for concreteness and because CM-type tests often outperform KS-type tests. To establis te weak convergence of M n, we rst sow tat p n[ ^ n; () ()] converges to a Gaussian process. De ne n () = p n f (W i ; ) E [ (W i ; )]g. i= Ten Lemmas 3 and 4 imply tat sup p i n ^n; () () n () = o p () : Teorem 6 below sows tat n () converges to a zero mean Gaussian process and so does p n[ ^n; () ()]. Teorem 6 Let Assumptions 5 old. Ten (a) n ()! d (); (b) p n[ ^ d n; () ()]! () ; were is a zero mean Gaussian process on wit covariance function cov (( ); ( )) = 4cov [ (W ; ) ; (W ; )] ( ; ). (8) If H also olds, ten T n () p n ^ n; () d! (). Let M : C ( )! R + be kk continuous. Ten te continuous mapping teorem (Billingsley 999, p. ) implies tat d M [T n ()]! M [()] 7

20 under te null ypotesis. For example, wit M [T n ()] = R [T n ()] d () ; we ave under H : M n M [T n ()] = 4. Global and Local Alternatives i [T n ()] d d () = n ^n; () d ()! [()] d () Te global alternatives for our conditional independence test can always be written as H a : f (z)f XY (x; y; z) f Y (y; z)f X (x; z) = (x; y; z); (9) for some nontrivial and nonzero function (x; y; z). Ten under H a, we ave () = '( ~w )(x; y; z)dxdydz: Tis will be nonzero for essentially all provided tat ' is GCR. It follows from Teorem 6 tat lim Pr(M n > c n ) = n! for any critical value c n = o(n): Tat is, te test is consistent: as te sample size increases, te test will eventually detect te alternative H a. To construct a local alternative, we consider a mixture distribution of te form H a;n : f XY (x; y; z) = c p f n Y j (yjz) + p c ~(yjx; z) f X (x; z) ; (3) n were c is a constant and ~(yjx; z) is a conditional density function of Y ~ given ( X; ~ ) ~ suc tat ~Y 6? X ~ j : ~ By construction, ~(yjx; z) is a nontrivial function of x and z: Tat is, te distribution of W is a mixture of two distributions: one satis es te null of conditional independence and te oter does not. Te mixing proportion is local to unity. Equivalently, we can rewrite te local alternative as (x; y; z) H a;n : f XY (x; y; z) = f Y j (y; z) f X (x; z) + p (3) n for (x; y; z) = c ~(yjx; z) f Y j (yjz) f X (x; z): Since ~(yjx; z) depends on x; ~(yjx; z) f Y j (yjz) cannot be a zero function. Hence wen ' is GCR and c > ; ' () := '( ~w ) (x; y; z) dxdydz 6= (3) for essentially all : Under Assumptions 5 and te local alternative H a;n, we can use te same arguments as in te proof of Teorem 6 to sow tat M n = [T n ()] d ()! d [() + ' ()] d (). Te essentially nonzero mean is te source of te power of te ICM test against te local alter- 8

21 native. Te above local alternative asymptotics can be used to guide te coice of te weigting function : Using Mercer s teorem, we can represent te covariance kernel ( ; ) as ( ; ) = X j e j ( )e j ( ) j= were j is te eigenvalue of te covariance kernel and e j () is te corresponding eigen function suc tat R ( ; ) e j ( ) d = j e j ( ) : fe j ()g also forms a set of complete ortonormal bases in L ( ). Boning and Sowell (999) sow tat coosing to be te uniform density delivers a test wit te greatest weigted average local power against te set of local alternatives wose limiting local mean functions f ' ()g assign weigts j to e j (): Tis provides some teoretical justi cation of te uniform weigting, wic is used in our simulation study. 4.3 Calculating te Asymptotic Critical Values Under te null, M n as a limiting distribution given by a functional of a zero mean Gaussian process wose covariance function depends on te DGP. Te asymptotic critical values tus depend on te DGP and cannot be tabulated. One could follow Bierens and Ploberger (997) and obtain upper bounds for te asymptotic critical values. Here, we use te conditional Monte Carlo approac suggested by Hansen (996) to simulate te asymptotic null distribution. To apply tis approac, we construct a process T n(); wic follows te desired zero mean Gaussian process conditional on fw i g. Te desired conditional covariance function for T n is cov [T n( ); T n( )j fw i g n i= ] = 4 n ^ ; (W i ; ) ^ ; (W i ; ) ^ ( ; ) ; i= were ^ ; (W i ; ) = (n ) j=;j6=i ; (W i ; W j ; ): It is straigtforward to sow tat under Assumptions -5 and te null ypotesis, ^ ( ; ) p! ( ; ). A typical Tn() is constructed by generating fv i g n i= as IID standard normal random variables independent of fw i g and setting T n() = p n ^ ; (W i ; )V i. (33) i= Following te arguments similar to te proof of Teorem in Hansen (996), we can sow tat under te null ypotesis, Mn = [Tn()] d ()! d [()] d () ; provided tat Assumptions -5 old. Simulation results sow tat te empirical pdf s of M n and 9

22 M n are fairly close. To save space, we do not report te results ere, but tey are available in Huang (9). To approximate te distribution of M n, we follow te steps below: (i) generate fv ib g n i= IID N(; ) random variables; (ii) set T n;b () p n ^ ; (W i ; )V ib ; i= (iii) set M n;b M T n;b () i = R T n;b () i d () : Tis gives a simulated sample (Mn; ; :::; M n;b ), wose empirical distribution sould be close to te true distribution of te actual test statistic M n under te null. Ten we can compute te proportion of simulated values tat exceed M n to get te simulated asymptotic p value. We reject te null ypotesis if te simulated p value lies below te speci ed level for te test. As Hansen (996) points out, B is under te control of te econometrician and can be cosen su ciently large to obtain a good approximation. 5 Monte Carlo Experiments In tis section, we perform some Monte Carlo simulation experiments to examine te nite sample performance of our conditional independence test. For all simulations, we generate IID f(x i ; Y i ; i )g. We coose '() to be te standard normal PDF, and k(u) be te sixt-order Gaussian kernel (q = 6). Te number of replications for eac experiment is, and te number of replications for simulating M n is Level and Power Studies We consider tree data generating processes. Under DGP, te sample f(x i ; Y i ; i )g is generated according to were "X " Y and s N Y = X + + " Y ; X = + + " X ; ; X Y = N s N(; ) = N(; 3): 4 ; Wen =, te null is true; oterwise te alternative olds. DGP is a modi cation of DGP tat focuses on te consequences of fat-tailed distributions. Under DGP, " X and " Y are proportional to te Student t wit 3 degrees of freedom: " X s t 3 ; " Y s t 3 ; " X? " Y :

23 DGP 3 is anoter modi cation of DGP. Under tis DGP, we allow skewness, coosing bot " X and " Y to be centered ci-square distributions: " X s ; " Y s ; " X? " Y : We transform eac variable so tat its range is comparable to te support of te GCR function '(): For te standard normal PDF, te support is te real line but te function is e ectively zero out of te interval [ 4; 4]: We transform eac variable to be supported on tis interval. Tis can be acieved by using te transformation below: X i! 8 X i X q A : X i X =(n ) We transform Y i and i analogously. Te conditional independence test is ten applied to te transformed data. We ignore te boundary bias ere as our suggested bias-removing transformation does not give rise to qualitatively di erent results. Altoug any compact wit a non-empty interior can be used, we take = [ ; ] 4 : Tis coice ensures tat f W ~ i ; g can take any value in te e ective support of '(): To compute te ICM statistic M n ; we need to compute te integral R [T n ()] d () were is te uniform distribution. In te absence of a closed-form expression, we recommend using te Monte Carlo integration metod. For eac simulation replication, we coose s s randomly from te uniform distribution on [ ; ] 4 and approximate te integral by te average P s= T n( s )=: We ave also tried using 5 random draws, but te results are e ectively te same. Note tat Tn( s ) depends on te bandwidt parameter : In our simulation experiments, we employ te data-driven bandwidt ^ ( s ) in (7) wit = n =[3(q+d )] and = :5: We use di erent bandwidts for di erent s. Given te bandwidt ^ ( s ) ; we compute te statistic Tn( s ) as Tn( s ) = n ^ ( n;^( s ) s) : Te average of Tn( s ) gives us te ICM statistic M n : We study te nite sample size and power of te test against conditional mean dependence. We use cov (X; Y j) X 4 X;Y j = = = p Xj Y j X q X Y to indicate te strengt of te dependence between X and Y, conditional on. Since bot Xj and Y j are normal under DGP, in tis case X;Y j fully captures te dependence between X and Y, conditional on. We plot te power of te tests for ranging from :9 to :9: For tis, we coose = X;Y j r for X;Y j X;Y j = :9; :8; :::; :9: Figures -3 report te size and power properties of our GCR test. Te size and power look fairly good for sample sizes as small as, and tey look very good wen te sample size reaces. Wen te sample size is small, te levels of te tests approac teir nominal value from below, delivering conservative tests. Wen te sample size increases to ; our tests become fairly accurate in size. Te sape and location of te power curves are well expected. Te power curves are also close to be symmetric, re ecting te equal capacity of our test to detect positive

7 Semiparametric Methods and Partially Linear Regression

7 Semiparametric Methods and Partially Linear Regression 7 Semiparametric Metods and Partially Linear Regression 7. Overview A model is called semiparametric if it is described by and were is nite-dimensional (e.g. parametric) and is in nite-dimensional (nonparametric).