Semiparametric Regression Based on Multiple Sources

Size: px

Start display at page:

Download "Semiparametric Regression Based on Multiple Sources"

Jesse Johns
5 years ago
Views:

1 Semiparametric Regression Based on Multiple Sources Benjamin Kedem Department of Mathematics & Inst. for Systems Research University of Maryland, College Park Give me a place to stand and rest my lever on, and I can move the Earth, (Archimedes, B.C.) AISC, October 5-7, 2012

2 1 Cheng, K.F., and Chu, C.K. (2004), Semiparametric density estimation under a two-sample density ratio model, Bernoulli, 10(4), Fokianos, K. (2004). Merging information for semiparametric density estimation. Journal of the Royal Statistical Society, Series B, 66, Fokianos, K., Kedem, B., Qin, J., Short, D.A. (2001). A semiparametric approach to the one-way layout. Technometrics, 43, McGlynn K.A., Sakoda1 L.C., Rubertone M.V., Sesterhenn I.A., Lyu C., Graubard B.I., Erickson R.L. (2007). Body size, dairy consumption, puberty, and risk of testicular germ cell tumors. American Journal of Epidemiology, 165, Qin, J., Zhang, B. (1997). A goodness of fit test for logistic regression models based on case-control data. Biometrika, 84, Qin, J., and Zhang, B. (2005). Density Estimation Under A Two-Sample Semiparametric Model. Nonparametric Statistics, 1, Kedem, B., Kim, E-Y, Voulgaraki, A., and Graubard, B. (2009). Two-Dimensional Semiparametric Density Ratio Modeling of Testicular Germ Cell Data. Statistics in Medicine, 2009, 28, Voulgaraki, A., Kedem, B., and Graubard, B.(2012). "Semiparametric Regression in Testicular Germ Cell Data", Annals of Applied Statistics, 6, Supplement: DOI: /12-AOAS552SUPP

3 Abstract 1 Information is combined from multiple multivariate sources. 2 Density ratio model: Multivariate distributions are regressed" on a reference distribution. 3 Random cocariates. 4 Kernel density estimator from many data sources: more efficient than the traditional single-sample kernel density estimator. 5 Each multivariate distribution and the corresponding conditional expectation (regression) of interest are estimated from the combined data using all sources. 6 Graphical and quantitative diagnostic tools are suggested to assess model validity. 7 The method is applied in quantifying the effect of height and age on weight of germ cell testicular cancer patients. 8 Comparison with multiple regression, GAM, and nonparametric kernel regression.

4 Main Points Density Ratio Examples Some Simulation Results Summary

5 Main Points Density Ratio Examples Some Simulation Results Summary

6 Main Points Density Ratio Examples Some Simulation Results Summary

7 Main Points Density Ratio Examples Some Simulation Results Summary

8 Main Points Density Ratio Examples Some Simulation Results Summary

9 Main Points Density Ratio Examples Some Simulation Results Summary

10 Outline 1 Density Ratio Examples

11 Multiple filtering of a signal f 1 (ω) = H 1 (ω) 2 f (ω).. (1). f q (ω) = H q (ω) 2 f (ω) That is, q distortions or multiple tilting of the same reference spectral density f.

12 Linear system with Gaussian noise X t = αx t 1 + ɛ t, ɛ = (ɛ 1t,..., ɛ qt, ɛ mt ) ɛ jt g j N(0, σ 2 j ), j = 1,..., q, m. Choose g m g. We get many distortions of the same reference g: g 1 (x) = e α 1+β 1 x 2 g(x)... g q (x) = e αq+βqx2 g(x)

13 Analysis of variance Consider the classical one-way ANOVA with m = q + 1 independent normal random samples: x 11,..., x 1n1 g 1 (x)... x q1,..., x qnq g q (x) x m1,..., x mnm g m (x) g j (x) N( µ j, σ 2 ), j = 1,..., m.

14 Then, holding g m (x) g(x) as a reference: g 1 (x) = exp(α 1 + β 1 x)g(x)... g q (x) = exp(α q + β q x)g(x) α j = µ2 m µ 2 j 2 σ 2, β j = µ j µ m σ 2, j = 1,..., q

15 k-parameter exponential families { k } g(x, θ) = d(θ)s(x) exp c i (θ)t i (x) i=1 Let g j (x) g(x, θ j ), g(x) g(x, θ m ). Again, q distortions of a reference g(x): g 1 (x) = exp{α 1 + β 1 h(x)}g(x)... g q (x) = exp{α q + β qh(x)}g(x)

16 Case-control: Multinomial logistic regression (Prentice and Pyke 1979). RV y s.t. P(y = j) = π j, m j=1 π j = 1. Assume: For j = 1,..., m, and any h(x), P(y = j x) = exp( α j + β jh(x)) 1+ q k=1 exp( α k + β kh(x)) Define: f (x y = j) = g j (x), j = 1,..., m Then with α j = α j + log[ π m /π j ], j = 1,..., q, and g m g, g j (x) = exp( α j + β j h(x))g(x), j = 1,..., q.

17 Weighted Distributions Rao(1965) unified the concept of weighted distributions of the form: W (x; α)p(x; θ) p w (x; θ, α) = E[W (X; α)] where W (x; α) is known. This is an example of tilting of a reference distribution. By changing the weight W (x; α) we obtain different distortions of the same reference p(x; θ).

18 Biased sampling Length-biased Sampling: Vardi (1982) introduced F(y) = 1/µ y 0 xdg(x), y 0, where µ = 0 xdg(x) <. This is a tilt model. Biased Sampling/Selection Bias: Vardi (1985), and Gill, Vardi, Wellner (1988) considered the more general biased sampling model F(y) = W (G) 1 y w(x)dg(x), where w(x) is known and W (G) = w(x)dg(x). This is a tilt model. F, G are obtained by NPMLE.

19 Biased sampling Length-biased Sampling: Vardi (1982) introduced F(y) = 1/µ y 0 xdg(x), y 0, where µ = 0 xdg(x) <. This is a tilt model. Biased Sampling/Selection Bias: Vardi (1985), and Gill, Vardi, Wellner (1988) considered the more general biased sampling model F(y) = W (G) 1 y w(x)dg(x), where w(x) is known and W (G) = w(x)dg(x). This is a tilt model. F, G are obtained by NPMLE.

20 Comparison Distributions (Parzen 1977,...,2009) Sequence of cdf s: {F 1,..., F q } G, with cont. densities f 1,..., f q, g. Comparison Distributions are defined as: D j (u; G, F j ) = F j (G 1 (u)), Then by differentiation, with x = G 1 (u): 0 < u < 1, j = 1,..., q f 1 (x) = d(g(x); G, F 1 )g(x) f q (x) = d(g(x); G, F q )g(x)

21 Outline 1 Density Ratio Examples

22 m = q + 1 independent p-dim multivariate samples: x ij = (x ij1, x ij2,..., x ijp ) g i (x 1,..., x p ), i = 1,..., q, m, j = 1,..., n i and x i1, x i2,..., x ini iid gi Reference or baseline probability density function (pdf): g g m (x) g m (x 1,..., x p )

23 Density ratio model: or equivalently g i (x) g(x) = w(x, θ i) (2) g i (x) = w(x, θ i )g(x) (3) 1. g i (x), g(x) are unknown. 2. w is a known positive and continuous function. 3. θ i are unknown d-dimensional parameters. 4. pdf s may be continuous or discrete. 5. Normal assumption is not made.

24 Problem: Use the combined data from all the samples x 11,..., x 1n1,...x m1,..., x mnm of size n = n n m to estimate all the parameters θ i, i = 1,..., q, and all the densities g 1,..., g q and g m = g.

25 G(x) G m (x), p ij = dg(x ij ) = dg m (x ij ). The empirical log-likelihood is given by: l = log L = m n i q n i log(p ij ) + log(w(x ij, θ i )) (4) i=1 j=1 i=1 j=1 subject to the constraints: p ij 0, m n i p ij = 1, i=1 j=1 m n i p ij w(x ij, θ k ) = 1 for k = 1,..., q. i=1 j=1 (5)

26 ˆp ij = 1 n 1 + q k=1 ˆµ k 1 [ w(x ij, ˆθ k ) 1 ], (6) Ĝ(x) = Ĝm(x) = m n i ˆp ij I(x ij x) (7) i=1 j=1 More generally, for l = 1,..., m and w(x ij, ˆθ m ) 1, Ĝ l (x) = m n i ˆp ij w(x ij, ˆθ l )I(x ij x) (8) i=1 j=1

27 Outline 1 Density Ratio Examples

28 Traditional single sample kernel density estimator: ˆf (x) = 1 nh p n n ( ) x xi K i=1 h n (9) h n is a sequence of bandwidths such that: 1. h n 0, nh p n as n. 2. K (x) is defined for p-dimensional x. 3 K (x) 0, symmetric around R p K (x)dx = 1. Under certain conditions, ˆf (x) is a consistent estimator of f (x) (Parzen 1962, Shao 2003).

29 Fokianos (2004), Cheng and Chu (2004), Qin and Zhang (2005), Voulgaraki, Kedem, Graubard (VKG) (2012), use the the probabilities ˆp ij instead of 1/n: ĝ l (x) = 1 h p n m n i ( ) x xij ˆp ij ŵ l (x ij )K i=1 j=1 h n (10)

30 VKG 2012: Assume that K ( ) is a nonnegative bounded symmetric function with K (x)dx = 1, x xk (x)dx = k 2 > 0. Assume that g l is continuous at x and twice differentiable in a neighborhood of x. If, as n, h n = O(n 1 4+p ), then ( nhn p ĝ l (x) g l (x) 1 2 h2 n u 2 g l (x ) ) D x x uk (u)du N(0, σ 2 (x)) where σ 2 (x) = w l(x)g l (x) m k=1 ζ kw k (x) K 2 (u)du for any fixed x.

31 The mean integrated square error (MISE) is defined as: ( ĝl MISE(ĝ l (x)) = E (x) g l (x) ) 2 dx (11) Minimizing the MISE with respect to h n, the optimal bandwidth is: hn (p/n) w l (x)g l (x)/ [ m k=1 = ζ kw k (x) ] dx K 2 (u)du ( ) 2 u ( 2 g l (x)/ x x )uk (u)du dx 1 4+p (12)

32 For sufficient large n, an alternative way to find h n is by minimizing h p n n n i=1 i =1 ˆp(t i )ŵ l (t i )ˆp(t i )ŵ l (t i ) K (z)k ( z + t i t i h n 2 ( xli x lj n l (n l 1)hn p K h n i j ) dz ). (13)

33 VKG 2012: If ˆf is the classical single-sample multivariate kernel density estimator of g l, then As n, h n 0 and nh p n, (a) AMISE(ĝ l ) AMISE(ˆf ) (b) Using optimal bandwidths, the proposed semiparametric density estimator ĝ l (x) is more efficient than ˆf (x), i.e for every l eff (ˆf, ĝ l ) AMISE (ĝ l ) AMISE (ˆf ) 1 where AMISE is the optimal AMISE.

34 Diagnostic Plots and Goodness of Fit Define ( ) } k Rα,k { 2 = 1 exp xα n i x α (14) n i : Sample size of ith sample. x α: The number of times the estimated semiparametric cdf falls in the estimated 1 α confidence interval obtained from the corresponding empirical cdf, both evaluated at the sample points. k > 0 abd α chosen by the user. R 2 α,k takes values between 0 and 1. Computing R 2 α,k is both simple and fast.

35 Qin and Zhang (1997): R 2 3 = exp( n max G i Ĝi ) (15) Clearly, R 2 3 takes values between 0 and 1.

36 Simulation Results: Ĝi vs. G i, i = 1, 2 Simulation ( 1: g ) 1 N((0, 0), Σ)) (case), g 2 N((0, 0), Σ)) (control), 4 2 Σ =, n = 40, n 2 = 30. Simulation 1: G1 Simulation 1: G2 Joint Sem. CDF case Joint Sem. ref. CDF control Joint Benjamin Empcdf case Kedem AISC: October Joint Empcdf 5-7, control 2012

37 Simulation ( 2: g ) 1 N((0, 0), Σ)) (case), g 2 N((1, 1), Σ)) (control) with 3 1 Σ =, n = 200, n 2 = 200. Simulation 2: G1 Simulation 2: G2 Joint Sem. CDF case Joint Sem. ref. CDF control Joint Empcdf case Joint Empcdf control

38 Simulation 3: g 1 from standard two dimensional Multivariate Cauchy (case) ( and g from two dimensional Multivariate Cauchy (control) with µ = (1, 1), V = 5 10 n 1 = 200, n 2 = 200. ), Simulation 3: G1 Simulation 3: G2 Joint Sem. CDF case Joint Sem. ref. CDF control Joint Empcdf case Joint Empcdf control

39 Simulation 4: g 1 from standard two dimensional Multivariate Cauchy (case) and g 2 from uniform distribution on the triangle (0, 0), (6, 0), ( 3, 4) (control), and n 1 = 200, n 2 = 200. Simulation 4: G1 Simulation 4: G2 Joint Sem. CDF case Joint Sem. ref. CDF control Joint Empcdf case Joint Empcdf control

40 Table: Comparison of R3 2 and R2.05,2 for 100 repetitions of case and control. Simulation Group R3 2 R.05,2 2 (1) Case Control (2) Case Control (3) Case Control (4) Case Control

41 Bandwidth (BW) selection using cross validation. Case Control Same BW Diff. BW s Same BW Diff. BW s h h 1 h 2 h h 1 h 2 Leave One Out Simulation Simulation Alternative (12) Simulation Simulation

42 Outline 1 Density Ratio Examples

43 Suppose we have m multivariate samples of p-dim vectors: (Random Covariates, Response) = (x 1,..., x (p 1), y) We wish to estimate m conditional expectations: E(y x 1,..., x (p 1) )

44 Data: i = 1,..., q, m, j = 1,..., n i, (x ij1, x ij2,..., x ij(p 1), y ij ) g i (x 1,..., x (p 1), y). Reference: Distortions: g g m (x 1,..., x (p 1), y) g i (x 1,..., x (p 1), y), i = 1,..., q

45 Model: As in ratio of pdf s from N(µ 1, Σ), N(µ 2, Σ) assume g i (x) g(x) = w(x, α i, β i ) exp(α i + β ix), i = 1,..., q (16) where x = (x 1,..., x (p 1), y) and β i = (β i1,..., β ip ).

46 Let t denote the vector of combined data of length n = n 1 + n n m. Following the method of constrained empirical likelihood we obtain score equations for ˆα j and ˆβ j : l α j = l β j = n i=1 n i=1 ρ j w j (t i ) 1 + ρ 1 w 1 (t i ) + + ρ q w q (t i ) + n j = 0 (17) ρ j w j (t i )t i 1 + ρ 1 w 1 (t i ) + + ρ q w q (t i ) + n j (x ji1,..., y ji ) (18) = 0 i=1 j = 1,..., q and ρ j = n j /n m w j (t i ) = exp(α j + β j t i)

47 ˆp i = 1 n m Ĝ(t) = 1 n m ρ 1 ŵ 1 (t i ) + + ρ q ŵ q (t i ) n I(t i t) 1 + ρ 1 ŵ 1 (t i ) + + ρ q ŵ q (t i ) i=1 (19) (20) where ŵ j (t i ) = exp(ˆα j + ˆβ jt i ) As n, the estimators ˆθ = (ˆα 1,, ˆα q, ˆβ 1,, ˆβ q ) are asymptotically normal (Qin and Zhang 1997, Lu 2007).

48 Under the p-dimensional density ratio model we can predict the response y given the covariate information x 1, x 2,..., x (p 1) for any of the m data sets as follows: n j ĝ j (x 1,..., x (p 1), y i ) Ê j (y x 1,..., x (p 1) ) = y i, j = 1,..., q, m. y i ĝ j (x 1,..., x (p 1), y i ) i (21) The ĝ j in (21) are the semiparametric kernel density estimates.

49 Assume that the data are bounded. Then: (a) As n, h 0 and nh p, Ê(y x) E(y x) g(x)dx 0 in the mean square sense. (b) If, in addition 0 < A < g(x), then Ê(y x) E(y x) dx 0 in the mean square sense.

50 Outline 1 Density Ratio Examples

51 Testicular germ cell tumor (TGCT) is a common cancer among U.S. men, mainly in the age group of years (McGlynn et al 2003). Increased risk is significantly related to height, whereas body mass index was not found significant (McGlynn et al 2007). Using a two dimensional semiparametric model, it was shown that jointly height and weight are significant risk factors (Kedem et al 2009). We use TGCT data with 3 variables: age, height (cm), weight (kg). Case: n 1 = 763. Control: n 2 = 928. n = 1691 individuals. We consider two cases: 2D with variables height and weight 3D with variables height, weight and age. In both cases the control distribution is the reference distribution.

52 2D Problem: E(Weight Height) Diagnostic plots of Ĝi versus G i, i = 1, 2 evaluated at (height,weight) pairs. Case Group Control Group Joint Sem. CDF Joint Sem. CDF Joint Empirical CDF Joint Empirical CDF

53 2D Case Regression 2D TGCT: E[Weight/Height] for G1 Residual plot for 2D TGCT Case weight Actual points Ref. Regression Semiparametric GAM NW hat(e) height Sem. E[weight/height]

54 2D Control Regression 2D TGCT: E[Weight/Height] for G2 Residual plot for 2D TGCT Control weight Actual points Ref. Regression Semiparametric GAM NW hat(e) height Sem. E[weight/height]

55 3D Problem: E(Weight Height, Age) Diagnostic plots of Ĝi versus G i, i = 1, 2 evaluated at (age,height,weight) triplets. Case Group Control Group Joint Sem. CDF Joint Sem. CDF Joint Empirical CDF Joint Empirical CDF

56 3D Residuals Residual plots for the semiparametric model in the 3D TGCT data set. Residual plot for 3D TGCT Case Residual plot for 3D TGCT Control hat(e) hat(e) Sem. E[weight/height, age] Sem. E[weight/height, age]

57 Some joint probabilities of age, height and weight in the case and control groups. Probability Case Control Pr(A 45, H , W ) Pr(A 26, H , W ) Pr(A 29, H , W ) Pr(A 33, H , W ) Pr(A 34, H , W ) Pr(A 37, H , W ) Pr(A 40, H , W ) Pr(A 43, H , W ) Pr(A 45, H , W )

58 Case-control weight and Ê[weight height, age]. Empty entries in the table correspond to subjects with the same height and age, but possibly different weights. Control Case Age Height Weight Ê[W H, A] Weight Ê[W H, A]

59 Case-control weight and Ê[weight height, age]. Empty entries in the table correspond to subjects with the same height and age, but possibly different weights. Control Case Age Height Weight Ê[W H, A] Weight Ê[W H, A]

60 Case-control weight and Ê[weight height, age]. Empty entries in the table correspond to subjects with the same height and age, but possibly different weights. Control Case Age Height Weight Ê[W H, A] Weight Ê[W H, A]

Semiparametric Regression Based on Multiple Sources

Semiparametric Regression Based on Multiple Sources Benjamin Kedem Department of Mathematics & Inst. for Systems Research University of Maryland, College Park Give me a place to stand and rest my lever