Partial Generalized Additive Models

Size: px

Start display at page:

Download "Partial Generalized Additive Models"

Benedict Alexander
5 years ago
Views:

1 Partial Generalized Additive Models An Information-theoretic Approach for Selecting Variables and Avoiding Concurvity Hong Gu 1 Mu Zhu 2 1 Department of Mathematics and Statistics Dalhousie University 2 Department of Statistics and Actuarial Science University of Waterloo March 16, 2009

2 Outline Introduction The concurvity and interpretation of GAM An illustrative example Sequential maximization of mutual information and pgam GAM pgam Partial generalized additive models Simulation and examples A simulation study Ozone data Air pollution and mortality data Summary and discussions

3 Genaralized additive models (GAM) Response varaible: Y, Predictor variables: X = (X 1,..., X p ) GAM: E(Y X) = h(η(x)) = h(f 0 + f 1 (X 1 ) f p (X p )) Response variable Y is from an exponential family distribution and h is a known monotonic link function. GAM is popular due to: simple form and intuitive interpretation of the effect of the individual predictors on the response variable. predictive accuracy

4 Genaralized additive models (GAM) Response varaible: Y, Predictor variables: X = (X 1,..., X p ) GAM: E(Y X) = h(η(x)) = h(f 0 + f 1 (X 1 ) f p (X p )) Response variable Y is from an exponential family distribution and h is a known monotonic link function. GAM is popular due to: simple form and intuitive interpretation of the effect of the individual predictors on the response variable. predictive accuracy

5 Concurvity and interpretation of GAM However, the interpretation is not straightforward: the contributions from different variables are generally not independent. Concurvity: when there are strong functional relationships among predictor variables. (Hastie and Tibshirani, 1990; Donnell, Buja and Stuetzle,1994) the analogue of collinearity.

6 The seminal contribution of Simon Wood Concurvity dealt with by controlling the complexity or smoothness of each fitted function: Shrinkage methods. Wood (2000): a general methodology to efficiently select multiple smoothing parameters. Wood (2004) solved a difficult numeric rank deficiency problem; showed that his methods provided much more stable functional reconstruction and gave very competitive MSE. Wood (2006) gam (mgcv) the current state-of-the-art of GAM fitting. The model interpretation when concurvity structures exist? model simplification and variable selection in GAM?

7 An illustrative example X 1, X 2, X 3, X 4 iid U(0,1), X 5 = 2X1 3 + N(0, ) Y = (5e X 1 + 2X1 3) + X 3 + N(0, ) What is the effect of X 1 on Y? The function f(x) = 5e x + 2x 3 on [0, 1]. 5e ( x) + 2x

8 Illustrative example. Effects estimated by GAM s(x1,3.51) s(x2,1) s(x3,1.48) X X X3 s(x4,1) s(x5,1) X X5

9 Illustrative example. Effects estimated by pgam Only X (1) and X (3) are included in the final model by pgam. s(x1,4) s(x3,2) X X3

10 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

11 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

12 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

13 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

14 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

15 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

16 MI and GAM MI is the amount of information in X that can be used to reduce the uncertainty of Y or the amount of inforamtion in Y that can be used to reduce the uncertainty in X. MI(Y; X 1,, X p ) = H(Y) H(Y X) = H(X) H(X Y) Function approximation is to find η(x) to maximize the MI between Y and η(x) = f(x 1,, X p ). GAM uses the first (or lower) order ANOVA-like decomposition of E(Y X 1,, X p ) = f(x 1,, x p ) to deal with the curse of dimensionality. Note: the maximum value of MI(Y,η(X)) is invariant to choices of the link function.

17 MI and GAM MI is the amount of information in X that can be used to reduce the uncertainty of Y or the amount of inforamtion in Y that can be used to reduce the uncertainty in X. MI(Y; X 1,, X p ) = H(Y) H(Y X) = H(X) H(X Y) Function approximation is to find η(x) to maximize the MI between Y and η(x) = f(x 1,, X p ). GAM uses the first (or lower) order ANOVA-like decomposition of E(Y X 1,, X p ) = f(x 1,, x p ) to deal with the curse of dimensionality. Note: the maximum value of MI(Y,η(X)) is invariant to choices of the link function.

18 MI and GAM MI is the amount of information in X that can be used to reduce the uncertainty of Y or the amount of inforamtion in Y that can be used to reduce the uncertainty in X. MI(Y; X 1,, X p ) = H(Y) H(Y X) = H(X) H(X Y) Function approximation is to find η(x) to maximize the MI between Y and η(x) = f(x 1,, X p ). GAM uses the first (or lower) order ANOVA-like decomposition of E(Y X 1,, X p ) = f(x 1,, x p ) to deal with the curse of dimensionality. Note: the maximum value of MI(Y,η(X)) is invariant to choices of the link function.

19 MI and GAM MI is the amount of information in X that can be used to reduce the uncertainty of Y or the amount of inforamtion in Y that can be used to reduce the uncertainty in X. MI(Y; X 1,, X p ) = H(Y) H(Y X) = H(X) H(X Y) Function approximation is to find η(x) to maximize the MI between Y and η(x) = f(x 1,, X p ). GAM uses the first (or lower) order ANOVA-like decomposition of E(Y X 1,, X p ) = f(x 1,, x p ) to deal with the curse of dimensionality. Note: the maximum value of MI(Y,η(X)) is invariant to choices of the link function.

20 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

21 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

22 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

23 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

24 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

25 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

26 MI and pgam An alternative way to fit the second term, MI(Y; X 2 X 1 ), leads us to a new procedure. Suppose X 2 = g 21 (X 1 ) + X (2) where X 1 X (2) : MI(Y; X 2 X 1 ) = H(X 2 X 1 ) H(X 2 Y, X 1 ) = MI(Y; X (2) ) first, estimate g 21 by smoothing X 2 onto X 1 ; then, fit a (univariate) GAM of Y onto X (2) X 2 g 21 (X 1 ). X (2) X 1. This provides a natural way to avoid concurvity and constitutes the main idea of our procedure, pgam.

27 MI and pgam An alternative way to fit the second term, MI(Y; X 2 X 1 ), leads us to a new procedure. Suppose X 2 = g 21 (X 1 ) + X (2) where X 1 X (2) : MI(Y; X 2 X 1 ) = H(X 2 X 1 ) H(X 2 Y, X 1 ) = MI(Y; X (2) ) first, estimate g 21 by smoothing X 2 onto X 1 ; then, fit a (univariate) GAM of Y onto X (2) X 2 g 21 (X 1 ). X (2) X 1. This provides a natural way to avoid concurvity and constitutes the main idea of our procedure, pgam.

28 MI and pgam An alternative way to fit the second term, MI(Y; X 2 X 1 ), leads us to a new procedure. Suppose X 2 = g 21 (X 1 ) + X (2) where X 1 X (2) : MI(Y; X 2 X 1 ) = H(X 2 X 1 ) H(X 2 Y, X 1 ) = MI(Y; X (2) ) first, estimate g 21 by smoothing X 2 onto X 1 ; then, fit a (univariate) GAM of Y onto X (2) X 2 g 21 (X 1 ). X (2) X 1. This provides a natural way to avoid concurvity and constitutes the main idea of our procedure, pgam.

29 MI and pgam An alternative way to fit the second term, MI(Y; X 2 X 1 ), leads us to a new procedure. Suppose X 2 = g 21 (X 1 ) + X (2) where X 1 X (2) : MI(Y; X 2 X 1 ) = H(X 2 X 1 ) H(X 2 Y, X 1 ) = MI(Y; X (2) ) first, estimate g 21 by smoothing X 2 onto X 1 ; then, fit a (univariate) GAM of Y onto X (2) X 2 g 21 (X 1 ). X (2) X 1. This provides a natural way to avoid concurvity and constitutes the main idea of our procedure, pgam.

30 An explicit variable selection procedure at each step, choose to enter the variable whose MI with Y is the largest. stop when the MI between Y and any of the remaining input variables becomes fairly small. The covariates deemed important a priori are always included in the initial model.

31 Indirect estimation of MI Direct estimation of MI is not a trivial problem. MI(X; Y) = H(X) + H(Y) H(X, Y). Instead, we work with a proxy of MI(Y; X): MI(Y; X) = max MI(Y;η(X)). η if η(x) is sufficient for Y, then MI(Y;η(X)) = MI(Y; X) MI(Y;η(X)) = E(l(Y η(x))) E log f Y (Y). Thus only need conditional log-likelihood to get the right order of the covariates in each step.

32 Indirect estimation of MI Direct estimation of MI is not a trivial problem. MI(X; Y) = H(X) + H(Y) H(X, Y). Instead, we work with a proxy of MI(Y; X): MI(Y; X) = max MI(Y;η(X)). η if η(x) is sufficient for Y, then MI(Y;η(X)) = MI(Y; X) MI(Y;η(X)) = E(l(Y η(x))) E log f Y (Y). Thus only need conditional log-likelihood to get the right order of the covariates in each step.

33 Indirect estimation of MI Direct estimation of MI is not a trivial problem. MI(X; Y) = H(X) + H(Y) H(X, Y). Instead, we work with a proxy of MI(Y; X): MI(Y; X) = max MI(Y;η(X)). η if η(x) is sufficient for Y, then MI(Y;η(X)) = MI(Y; X) MI(Y;η(X)) = E(l(Y η(x))) E log f Y (Y). Thus only need conditional log-likelihood to get the right order of the covariates in each step.

34 The pgam algorithm: Initialization 1. Start with a null model m by fitting a GAM of Y onto a constant; let D 0 be the deviance of m. 2. Center all X j s to have mean zero; let X w = {X (j) = X j ; j = 1, 2,..., p} be the set of working variables. 3. Set t = 1.

35 The pgam algorithm: Initialization 1. Start with a null model m by fitting a GAM of Y onto a constant; let D 0 be the deviance of m. 2. Center all X j s to have mean zero; let X w = {X (j) = X j ; j = 1, 2,..., p} be the set of working variables. 3. Set t = 1.

36 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

37 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

38 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

39 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

40 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

41 The pgam algorithm: Output Output: the model m and the g ji s.

42 Simulation study: Variable selection X 1, X 2, X 3, X 4 iid U(0,1), X 5 = 2X N(0,σ2 1 ) Y = (5e X 1 + 2X 3 1 ) + X 3 + N(0,σ 2 2 ) Number of times different variable combinations are selected by pgam, out of 500 simulations. A star (*) means a variable other than X 1, X 3, or X 5. H = high, M = medium, L = low. Concurvity SNR (1,3) (1,3,5) (1,3,*) (1) (5) (5,3) other Strong H (σ 2 = 0.1) (σ 1 = 0.01) M (σ 2 = 0.5) L (σ 2 = 1.0) Medium H (σ 2 = 0.1) (σ 1 = 0.50) M (σ 2 = 0.5) L (σ 2 = 1.0) Weak H (σ 2 = 0.1) (σ 1 = 0.90) M (σ 2 = 0.5) L (σ 2 = 1.0)

43 Simulation study: Prediction on test sets RMSE and RPSE. DIFF-RMSE = RMSE(GAM) - RMSE(pGAM) positive differences indicate that pgam is (slightly) better. DIFF-RMSE DIFF-RPSE Concurvity SNR mean (stdev) mean (stdev) Strong H (σ 2 = 0.1) (0.0038) (0.0010) (σ 1 = 0.01) M (σ 2 = 0.5) (0.0220) (0.0057) L (σ 2 = 1.0) (0.0800) (0.0183) Medium H (σ 2 = 0.1) (0.0037) (0.0009) (σ 1 = 0.50) M (σ 2 = 0.5) (0.0200) (0.0045) L (σ 2 = 1.0) (0.0421) (0.0091) Weak H (σ 2 = 0.1) (0.0036) (0.0009) (σ 1 = 0.90) M (σ 2 = 0.5) (0.0188) (0.0041) L (σ 2 = 1.0) (0.0416) (0.0088)

44 Simulation study: Estimated functional effects by pgam strong concurvity (σ 1 = 0.01) and medium SNR (σ 2 = 0.5) case. Pointwise mean and CI based on that pgam chooses the right variable combination (489 out of 500 simulations). PGAM: mean and 90% CI for s(x1) PGAM: mean and 90% CI for s(x3) X X3

45 Simulation study: Estimated functional effects by GAM strong concurvity (σ 1 = 0.01) and medium SNR (σ 2 = 0.5). Pointwise mean together with 5th and 95th percentiles (500 simulations). GAM: mean and 90% CI for s(x1) GAM: mean and 90% CI for s(x2) GAM: mean and 90% CI for s(x3) X X X3 GAM: mean and 90% CI for s(x4) GAM: mean and 90% CI for s(x5) X X5

46 Ozone Data: variables selected by pgam Ozone: Gaussian response. Table: Variables in the ozone data set. Name Description ozone logarithm of ozone concentration (log-ppm) temp(1) Sandburg Air Force Base temperature ibh (2) inversion base height dpg (6) Daggert pressure gradient vis (5) visibility in miles vh Vandenburg 500 millibar pressure height humidity (3) humidity (%) ibt inversion base temperature wind wind speed (mph) doy (4) day of the year

47 GAM, pgam and GAM of the same df. as pgam: (1) s(temp,3.79) s(temp,4) s(temp,4) temp temp temp s(ibh,2.75) s(ibh,5) s(ibh,5) ibh ibh ibh s(humidity,2.38) s(humidity,5) s(humidity,5) humidity humidity humidity

48 GAM, pgam and GAM of the same df. as pgam (2) s(doy,4.55) s(doy,4) s(doy,4) doy doy doy s(vis,5.51) s(vis,7) s(vis,7) vis vis vis s(dpg,3.3) s(dpg,3) s(dpg,3) dpg dpg dpg

49 Ozone data: differences of the effects of covariates between GAM and pgam The effect of temp is much closer to a simple linear effect in pgam. For GAM, humidity is not a significant covariate (p-value=0.06 for default GAM and p-value=0.13 for GAM using only six variables). For pgam, after removing the partial effects of temp and ibh the first two variables selected, humidity becomes a significant covariate (p-value = ). Visually, we can see that the effect of humidity is much less flat in pgam. The effects of doy and dpg estimated by pgam peak at different locations than those estimated by GAM.

50 Ozone data: differences of the effects of covariates between GAM and pgam The effect of temp is much closer to a simple linear effect in pgam. For GAM, humidity is not a significant covariate (p-value=0.06 for default GAM and p-value=0.13 for GAM using only six variables). For pgam, after removing the partial effects of temp and ibh the first two variables selected, humidity becomes a significant covariate (p-value = ). Visually, we can see that the effect of humidity is much less flat in pgam. The effects of doy and dpg estimated by pgam peak at different locations than those estimated by GAM.

51 Ozone data: differences of the effects of covariates between GAM and pgam The effect of temp is much closer to a simple linear effect in pgam. For GAM, humidity is not a significant covariate (p-value=0.06 for default GAM and p-value=0.13 for GAM using only six variables). For pgam, after removing the partial effects of temp and ibh the first two variables selected, humidity becomes a significant covariate (p-value = ). Visually, we can see that the effect of humidity is much less flat in pgam. The effects of doy and dpg estimated by pgam peak at different locations than those estimated by GAM.

52 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

53 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

54 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

55 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

56 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

57 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

58 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

59 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

60 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

61 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

62 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

63 Partial effects among covariates estimated by pgam ibh humidity doy vis temp temp temp temp dpg vh wind ibt temp temp temp temp humidity doy vis dpg ibh ibh ibh ibh vh wind ibt ibh ibh ibh

64 Partial effects among covariates estimated by pgam doy vis dpg vh wind ibt humidity humidity humidity humidity humidity humidity vis dpg vh wind ibt doy doy doy doy doy dpg vh wind ibt vis vis vis vis vh wind ibt dpg dpg dpg

65 Air pollution and mortality data: Philadelphia 1995 to 2000 The National Mortality, Morbidity, and Air Pollution Study (NMMAPS): daily mortality, air pollution, and weather data. http// log(λ t ) = f(t) + d j=1 g j(x jt ) + h(z t ) mortality: Poisson response. Table: Variables in the Philadelphia air pollution data set. Variable Name Description y t mortality number of non-accidental deaths in age group t time measured in days, i.e., 1, 2,..., 2191 z t pollutant daily NO 2 concentration x 1t temp average daily temperature x 2t dptp daily dewpoint temperature

66 Philadelphia air pollution and mortality data set: Approximate significance of smooth terms Model mortality s(time, df = 20) + s(temp, df = 3) + s(dptp, df = 3) + s(pollutant, df = 2). p-value Term GAM pgam g 1 (x 1t ) s(temp, df = 3) g 2 (x 2t ) s(dptp, df = 3) h(z t ) s(pollutant, df = 2) f(t) s(time, df = 20)

67 Philadelphia air pollution and mortality data set: Effects of covariates estimated by GAM. s(temp,3) s(dptp,3) temp dptp s(pollutant,2) s(time,20) pollutant time

68 Philadelphia air pollution and mortality data. Effects of covariates estimated by pgam Only two covariates, t (time) and z t (pollutant), are selected. s(time,20) s(pollutant,2) time pollutant

69 Philadelphia air pollution and mortality data. Partial effects estimated by pgam temp dptp pollutant time time time temp dptp pollutant pollutant

70 Philadelphia air pollution and mortality data: summary Overall, our analysis here suggests: (i) that mortality for Philadelphia residents between the ages of 65 and 75 was decreasing during the period of ; (ii) that mortality for this population was the highest in winter and the lowest in summer; and (iii) that, after adjusting for the strong seasonal effect, air pollution in the form of nitrogen dioxide still appeared to significantly increase mortality for this population.

71 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

72 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

73 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

74 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

75 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

Partial Generalized Additive Models

Partial Generalized Additive Models An information theoretical approach to avoid the concurvity Hong Gu 1 Mu Zhu 2 1 Department of Mathematics and Statistics Dalhousie University 2 Department of Statistics and Actuarial Science University