Partial Generalized Additive Models

Size: px
Start display at page:

Download "Partial Generalized Additive Models"

Transcription

1 Partial Generalized Additive Models An Information-theoretic Approach for Selecting Variables and Avoiding Concurvity Hong Gu 1 Mu Zhu 2 1 Department of Mathematics and Statistics Dalhousie University 2 Department of Statistics and Actuarial Science University of Waterloo March 16, 2009

2 Outline Introduction The concurvity and interpretation of GAM An illustrative example Sequential maximization of mutual information and pgam GAM pgam Partial generalized additive models Simulation and examples A simulation study Ozone data Air pollution and mortality data Summary and discussions

3 Genaralized additive models (GAM) Response varaible: Y, Predictor variables: X = (X 1,..., X p ) GAM: E(Y X) = h(η(x)) = h(f 0 + f 1 (X 1 ) f p (X p )) Response variable Y is from an exponential family distribution and h is a known monotonic link function. GAM is popular due to: simple form and intuitive interpretation of the effect of the individual predictors on the response variable. predictive accuracy

4 Genaralized additive models (GAM) Response varaible: Y, Predictor variables: X = (X 1,..., X p ) GAM: E(Y X) = h(η(x)) = h(f 0 + f 1 (X 1 ) f p (X p )) Response variable Y is from an exponential family distribution and h is a known monotonic link function. GAM is popular due to: simple form and intuitive interpretation of the effect of the individual predictors on the response variable. predictive accuracy

5 Concurvity and interpretation of GAM However, the interpretation is not straightforward: the contributions from different variables are generally not independent. Concurvity: when there are strong functional relationships among predictor variables. (Hastie and Tibshirani, 1990; Donnell, Buja and Stuetzle,1994) the analogue of collinearity.

6 The seminal contribution of Simon Wood Concurvity dealt with by controlling the complexity or smoothness of each fitted function: Shrinkage methods. Wood (2000): a general methodology to efficiently select multiple smoothing parameters. Wood (2004) solved a difficult numeric rank deficiency problem; showed that his methods provided much more stable functional reconstruction and gave very competitive MSE. Wood (2006) gam (mgcv) the current state-of-the-art of GAM fitting. The model interpretation when concurvity structures exist? model simplification and variable selection in GAM?

7 An illustrative example X 1, X 2, X 3, X 4 iid U(0,1), X 5 = 2X1 3 + N(0, ) Y = (5e X 1 + 2X1 3) + X 3 + N(0, ) What is the effect of X 1 on Y? The function f(x) = 5e x + 2x 3 on [0, 1]. 5e ( x) + 2x

8 Illustrative example. Effects estimated by GAM s(x1,3.51) s(x2,1) s(x3,1.48) X X X3 s(x4,1) s(x5,1) X X5

9 Illustrative example. Effects estimated by pgam Only X (1) and X (3) are included in the final model by pgam. s(x1,4) s(x3,2) X X3

10 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

11 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

12 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

13 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

14 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

15 Mutual information (MI) and its properties MI provides a good measure for the strength of statistical dependency between random variables. MI is defined as MI XY = E { } f(x, Y) log. f X (X)f Y (Y) The following properties make MI a nonlinear analogue of linear correlation ρ (Brillinger, 2004): (1) I XY = 0 iff X is independent of Y. (2) For the continuous case, I XY = if Y = g(x). (3) Invariance, I XY = I UV if u = u(x) and v = v(y) are individually 1-1 measureable transformations. (4) For the bivariate normal, I XY = 1 2 log(1 ρ2 XY ).

16 MI and GAM MI is the amount of information in X that can be used to reduce the uncertainty of Y or the amount of inforamtion in Y that can be used to reduce the uncertainty in X. MI(Y; X 1,, X p ) = H(Y) H(Y X) = H(X) H(X Y) Function approximation is to find η(x) to maximize the MI between Y and η(x) = f(x 1,, X p ). GAM uses the first (or lower) order ANOVA-like decomposition of E(Y X 1,, X p ) = f(x 1,, x p ) to deal with the curse of dimensionality. Note: the maximum value of MI(Y,η(X)) is invariant to choices of the link function.

17 MI and GAM MI is the amount of information in X that can be used to reduce the uncertainty of Y or the amount of inforamtion in Y that can be used to reduce the uncertainty in X. MI(Y; X 1,, X p ) = H(Y) H(Y X) = H(X) H(X Y) Function approximation is to find η(x) to maximize the MI between Y and η(x) = f(x 1,, X p ). GAM uses the first (or lower) order ANOVA-like decomposition of E(Y X 1,, X p ) = f(x 1,, x p ) to deal with the curse of dimensionality. Note: the maximum value of MI(Y,η(X)) is invariant to choices of the link function.

18 MI and GAM MI is the amount of information in X that can be used to reduce the uncertainty of Y or the amount of inforamtion in Y that can be used to reduce the uncertainty in X. MI(Y; X 1,, X p ) = H(Y) H(Y X) = H(X) H(X Y) Function approximation is to find η(x) to maximize the MI between Y and η(x) = f(x 1,, X p ). GAM uses the first (or lower) order ANOVA-like decomposition of E(Y X 1,, X p ) = f(x 1,, x p ) to deal with the curse of dimensionality. Note: the maximum value of MI(Y,η(X)) is invariant to choices of the link function.

19 MI and GAM MI is the amount of information in X that can be used to reduce the uncertainty of Y or the amount of inforamtion in Y that can be used to reduce the uncertainty in X. MI(Y; X 1,, X p ) = H(Y) H(Y X) = H(X) H(X Y) Function approximation is to find η(x) to maximize the MI between Y and η(x) = f(x 1,, X p ). GAM uses the first (or lower) order ANOVA-like decomposition of E(Y X 1,, X p ) = f(x 1,, x p ) to deal with the curse of dimensionality. Note: the maximum value of MI(Y,η(X)) is invariant to choices of the link function.

20 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

21 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

22 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

23 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

24 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

25 MI and GAM Generally for any η(x), MI(Y; X 1,, X p ) MI (Y;η(X)). If Y X given η(x), then MI(Y; X 1,, X p ) = MI (Y;η(X)). The chain rule for MI: MI(Y; X 1,, X p ) = MI(Y; X 1 ) + MI(Y; X 2 X 1 ) + + MI(Y; X p X p 1,, X 1 ). Finding f1 (X 1 ) to approach MI(Y; X 1 ) max η E(l(Y η(x 1 ))) finding E(Y X 1 ) = f 1 (X 1 ). Denote Y = f1 (X 1 ) + Z 1 Y, where Z 1 Y X 1: MI(Y; X 2 X 1 ) = MI(Z 1 Y, X 2) This leads to the familiar back-fitting algorithm.

26 MI and pgam An alternative way to fit the second term, MI(Y; X 2 X 1 ), leads us to a new procedure. Suppose X 2 = g 21 (X 1 ) + X (2) where X 1 X (2) : MI(Y; X 2 X 1 ) = H(X 2 X 1 ) H(X 2 Y, X 1 ) = MI(Y; X (2) ) first, estimate g 21 by smoothing X 2 onto X 1 ; then, fit a (univariate) GAM of Y onto X (2) X 2 g 21 (X 1 ). X (2) X 1. This provides a natural way to avoid concurvity and constitutes the main idea of our procedure, pgam.

27 MI and pgam An alternative way to fit the second term, MI(Y; X 2 X 1 ), leads us to a new procedure. Suppose X 2 = g 21 (X 1 ) + X (2) where X 1 X (2) : MI(Y; X 2 X 1 ) = H(X 2 X 1 ) H(X 2 Y, X 1 ) = MI(Y; X (2) ) first, estimate g 21 by smoothing X 2 onto X 1 ; then, fit a (univariate) GAM of Y onto X (2) X 2 g 21 (X 1 ). X (2) X 1. This provides a natural way to avoid concurvity and constitutes the main idea of our procedure, pgam.

28 MI and pgam An alternative way to fit the second term, MI(Y; X 2 X 1 ), leads us to a new procedure. Suppose X 2 = g 21 (X 1 ) + X (2) where X 1 X (2) : MI(Y; X 2 X 1 ) = H(X 2 X 1 ) H(X 2 Y, X 1 ) = MI(Y; X (2) ) first, estimate g 21 by smoothing X 2 onto X 1 ; then, fit a (univariate) GAM of Y onto X (2) X 2 g 21 (X 1 ). X (2) X 1. This provides a natural way to avoid concurvity and constitutes the main idea of our procedure, pgam.

29 MI and pgam An alternative way to fit the second term, MI(Y; X 2 X 1 ), leads us to a new procedure. Suppose X 2 = g 21 (X 1 ) + X (2) where X 1 X (2) : MI(Y; X 2 X 1 ) = H(X 2 X 1 ) H(X 2 Y, X 1 ) = MI(Y; X (2) ) first, estimate g 21 by smoothing X 2 onto X 1 ; then, fit a (univariate) GAM of Y onto X (2) X 2 g 21 (X 1 ). X (2) X 1. This provides a natural way to avoid concurvity and constitutes the main idea of our procedure, pgam.

30 An explicit variable selection procedure at each step, choose to enter the variable whose MI with Y is the largest. stop when the MI between Y and any of the remaining input variables becomes fairly small. The covariates deemed important a priori are always included in the initial model.

31 Indirect estimation of MI Direct estimation of MI is not a trivial problem. MI(X; Y) = H(X) + H(Y) H(X, Y). Instead, we work with a proxy of MI(Y; X): MI(Y; X) = max MI(Y;η(X)). η if η(x) is sufficient for Y, then MI(Y;η(X)) = MI(Y; X) MI(Y;η(X)) = E(l(Y η(x))) E log f Y (Y). Thus only need conditional log-likelihood to get the right order of the covariates in each step.

32 Indirect estimation of MI Direct estimation of MI is not a trivial problem. MI(X; Y) = H(X) + H(Y) H(X, Y). Instead, we work with a proxy of MI(Y; X): MI(Y; X) = max MI(Y;η(X)). η if η(x) is sufficient for Y, then MI(Y;η(X)) = MI(Y; X) MI(Y;η(X)) = E(l(Y η(x))) E log f Y (Y). Thus only need conditional log-likelihood to get the right order of the covariates in each step.

33 Indirect estimation of MI Direct estimation of MI is not a trivial problem. MI(X; Y) = H(X) + H(Y) H(X, Y). Instead, we work with a proxy of MI(Y; X): MI(Y; X) = max MI(Y;η(X)). η if η(x) is sufficient for Y, then MI(Y;η(X)) = MI(Y; X) MI(Y;η(X)) = E(l(Y η(x))) E log f Y (Y). Thus only need conditional log-likelihood to get the right order of the covariates in each step.

34 The pgam algorithm: Initialization 1. Start with a null model m by fitting a GAM of Y onto a constant; let D 0 be the deviance of m. 2. Center all X j s to have mean zero; let X w = {X (j) = X j ; j = 1, 2,..., p} be the set of working variables. 3. Set t = 1.

35 The pgam algorithm: Initialization 1. Start with a null model m by fitting a GAM of Y onto a constant; let D 0 be the deviance of m. 2. Center all X j s to have mean zero; let X w = {X (j) = X j ; j = 1, 2,..., p} be the set of working variables. 3. Set t = 1.

36 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

37 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

38 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

39 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

40 The pgam algorithm: While t p 1. Fit a (univariate) GAM of Y onto every working variable in X w and record the deviance of each resulting GAM. 2. Suppose X (i) X w is the variable whose corresponding (univariate) GAM has the largest log-likelihood or the smallest deviance. Add X (i) into m; record the resulting deviance, D new ; and let X w X w \{X (i) }. 3. Test whether D new is a significant improvement over D 0, e.g., with an F -test for Gaussian or a χ 2 -test for binomial or Poisson. If insignificant, remove X (i) from m and output m. 4. For every X (j) X w (j i), fit the model X (j) = g ji (X (i) ) + ǫ j by smoothing X (j) onto X (i) ; record the fitted functions g ji. 5. Let t t + 1; D 0 D new ; and X (j) X (j) g ji (X (i) ).

41 The pgam algorithm: Output Output: the model m and the g ji s.

42 Simulation study: Variable selection X 1, X 2, X 3, X 4 iid U(0,1), X 5 = 2X N(0,σ2 1 ) Y = (5e X 1 + 2X 3 1 ) + X 3 + N(0,σ 2 2 ) Number of times different variable combinations are selected by pgam, out of 500 simulations. A star (*) means a variable other than X 1, X 3, or X 5. H = high, M = medium, L = low. Concurvity SNR (1,3) (1,3,5) (1,3,*) (1) (5) (5,3) other Strong H (σ 2 = 0.1) (σ 1 = 0.01) M (σ 2 = 0.5) L (σ 2 = 1.0) Medium H (σ 2 = 0.1) (σ 1 = 0.50) M (σ 2 = 0.5) L (σ 2 = 1.0) Weak H (σ 2 = 0.1) (σ 1 = 0.90) M (σ 2 = 0.5) L (σ 2 = 1.0)

43 Simulation study: Prediction on test sets RMSE and RPSE. DIFF-RMSE = RMSE(GAM) - RMSE(pGAM) positive differences indicate that pgam is (slightly) better. DIFF-RMSE DIFF-RPSE Concurvity SNR mean (stdev) mean (stdev) Strong H (σ 2 = 0.1) (0.0038) (0.0010) (σ 1 = 0.01) M (σ 2 = 0.5) (0.0220) (0.0057) L (σ 2 = 1.0) (0.0800) (0.0183) Medium H (σ 2 = 0.1) (0.0037) (0.0009) (σ 1 = 0.50) M (σ 2 = 0.5) (0.0200) (0.0045) L (σ 2 = 1.0) (0.0421) (0.0091) Weak H (σ 2 = 0.1) (0.0036) (0.0009) (σ 1 = 0.90) M (σ 2 = 0.5) (0.0188) (0.0041) L (σ 2 = 1.0) (0.0416) (0.0088)

44 Simulation study: Estimated functional effects by pgam strong concurvity (σ 1 = 0.01) and medium SNR (σ 2 = 0.5) case. Pointwise mean and CI based on that pgam chooses the right variable combination (489 out of 500 simulations). PGAM: mean and 90% CI for s(x1) PGAM: mean and 90% CI for s(x3) X X3

45 Simulation study: Estimated functional effects by GAM strong concurvity (σ 1 = 0.01) and medium SNR (σ 2 = 0.5). Pointwise mean together with 5th and 95th percentiles (500 simulations). GAM: mean and 90% CI for s(x1) GAM: mean and 90% CI for s(x2) GAM: mean and 90% CI for s(x3) X X X3 GAM: mean and 90% CI for s(x4) GAM: mean and 90% CI for s(x5) X X5

46 Ozone Data: variables selected by pgam Ozone: Gaussian response. Table: Variables in the ozone data set. Name Description ozone logarithm of ozone concentration (log-ppm) temp(1) Sandburg Air Force Base temperature ibh (2) inversion base height dpg (6) Daggert pressure gradient vis (5) visibility in miles vh Vandenburg 500 millibar pressure height humidity (3) humidity (%) ibt inversion base temperature wind wind speed (mph) doy (4) day of the year

47 GAM, pgam and GAM of the same df. as pgam: (1) s(temp,3.79) s(temp,4) s(temp,4) temp temp temp s(ibh,2.75) s(ibh,5) s(ibh,5) ibh ibh ibh s(humidity,2.38) s(humidity,5) s(humidity,5) humidity humidity humidity

48 GAM, pgam and GAM of the same df. as pgam (2) s(doy,4.55) s(doy,4) s(doy,4) doy doy doy s(vis,5.51) s(vis,7) s(vis,7) vis vis vis s(dpg,3.3) s(dpg,3) s(dpg,3) dpg dpg dpg

49 Ozone data: differences of the effects of covariates between GAM and pgam The effect of temp is much closer to a simple linear effect in pgam. For GAM, humidity is not a significant covariate (p-value=0.06 for default GAM and p-value=0.13 for GAM using only six variables). For pgam, after removing the partial effects of temp and ibh the first two variables selected, humidity becomes a significant covariate (p-value = ). Visually, we can see that the effect of humidity is much less flat in pgam. The effects of doy and dpg estimated by pgam peak at different locations than those estimated by GAM.

50 Ozone data: differences of the effects of covariates between GAM and pgam The effect of temp is much closer to a simple linear effect in pgam. For GAM, humidity is not a significant covariate (p-value=0.06 for default GAM and p-value=0.13 for GAM using only six variables). For pgam, after removing the partial effects of temp and ibh the first two variables selected, humidity becomes a significant covariate (p-value = ). Visually, we can see that the effect of humidity is much less flat in pgam. The effects of doy and dpg estimated by pgam peak at different locations than those estimated by GAM.

51 Ozone data: differences of the effects of covariates between GAM and pgam The effect of temp is much closer to a simple linear effect in pgam. For GAM, humidity is not a significant covariate (p-value=0.06 for default GAM and p-value=0.13 for GAM using only six variables). For pgam, after removing the partial effects of temp and ibh the first two variables selected, humidity becomes a significant covariate (p-value = ). Visually, we can see that the effect of humidity is much less flat in pgam. The effects of doy and dpg estimated by pgam peak at different locations than those estimated by GAM.

52 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

53 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

54 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

55 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

56 Comparison with the analysis by Donnel et al. (1994) Concurvity was found by Donnel et al., 1994: Given ibh there is a positive relationship between temp and ibt. Given temp, there is a negative relationship between ibh and ibt. The covariates temp and vh tend to increase together. There is a strong and complex (nonlinear) relationship involving ozone, temp, dpg, and doy. pgam has successfully detected and removed the first two concurvities. The third concurvity involves ozone. This suggests temp, dpg, and doy are all important covariates and the pgam model includes all of them.

57 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

58 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

59 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

60 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

61 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

62 Comparison with Breiman and Friedman (1985) alternating conditional expectations (ACE): ACE includes "temp","ibh","dpg","vis" and "doy". pgam includes all these plus "humidity". Use "doy" as single input, it peaked in late July and early August: the highest pollution days occur from July to September. ACE model: the peak was shifted to the beginning of May. This was puzzling to [them], since the highest pollution days occur from July to September. ACE paper interpretation: "doy" may serve as "a partial surrogate for hours of daylight before and during the morning commuter rush." pgam: the peak effect of "doy" occurs in late July. pgam suggests the shift is due to subtle concurvities.

63 Partial effects among covariates estimated by pgam ibh humidity doy vis temp temp temp temp dpg vh wind ibt temp temp temp temp humidity doy vis dpg ibh ibh ibh ibh vh wind ibt ibh ibh ibh

64 Partial effects among covariates estimated by pgam doy vis dpg vh wind ibt humidity humidity humidity humidity humidity humidity vis dpg vh wind ibt doy doy doy doy doy dpg vh wind ibt vis vis vis vis vh wind ibt dpg dpg dpg

65 Air pollution and mortality data: Philadelphia 1995 to 2000 The National Mortality, Morbidity, and Air Pollution Study (NMMAPS): daily mortality, air pollution, and weather data. http// log(λ t ) = f(t) + d j=1 g j(x jt ) + h(z t ) mortality: Poisson response. Table: Variables in the Philadelphia air pollution data set. Variable Name Description y t mortality number of non-accidental deaths in age group t time measured in days, i.e., 1, 2,..., 2191 z t pollutant daily NO 2 concentration x 1t temp average daily temperature x 2t dptp daily dewpoint temperature

66 Philadelphia air pollution and mortality data set: Approximate significance of smooth terms Model mortality s(time, df = 20) + s(temp, df = 3) + s(dptp, df = 3) + s(pollutant, df = 2). p-value Term GAM pgam g 1 (x 1t ) s(temp, df = 3) g 2 (x 2t ) s(dptp, df = 3) h(z t ) s(pollutant, df = 2) f(t) s(time, df = 20)

67 Philadelphia air pollution and mortality data set: Effects of covariates estimated by GAM. s(temp,3) s(dptp,3) temp dptp s(pollutant,2) s(time,20) pollutant time

68 Philadelphia air pollution and mortality data. Effects of covariates estimated by pgam Only two covariates, t (time) and z t (pollutant), are selected. s(time,20) s(pollutant,2) time pollutant

69 Philadelphia air pollution and mortality data. Partial effects estimated by pgam temp dptp pollutant time time time temp dptp pollutant pollutant

70 Philadelphia air pollution and mortality data: summary Overall, our analysis here suggests: (i) that mortality for Philadelphia residents between the ages of 65 and 75 was decreasing during the period of ; (ii) that mortality for this population was the highest in winter and the lowest in summer; and (iii) that, after adjusting for the strong seasonal effect, air pollution in the form of nitrogen dioxide still appeared to significantly increase mortality for this population.

71 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

72 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

73 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

74 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

75 Summary and discussions The back-fitting algorithm for GAM can be conceptually viewed as a sequential method for maximizing MI. pgam gives better estimates of the covariates functional effects when concurvity structures exist. A useful observation about estimating MI: first maximizing the conditional log-likelihood of Y given X and then estimating the entropy of Y alone. Research on fitting GAMs with unknown link functions (e.g., Horowitz, 2001; Cadarso-Suarez, et al., 2005): from the invariance property of MI, it s clear that link function can be freely chosen to facilitate model interpretation without affecting the goodness-of-fit. If the function η is not fully flexible, then the choice of the link function will make some difference. Such as in GLM.

Partial Generalized Additive Models

Partial Generalized Additive Models An information theoretical approach to avoid the concurvity Hong Gu 1 Mu Zhu 2 1 Department of Mathematics and Statistics Dalhousie University 2 Department of Statistics and Actuarial Science University

More information

Partial Generalized Additive Models: An Information-Theoretic Approach for Dealing With Concurvity and Selecting Variables

Partial Generalized Additive Models: An Information-Theoretic Approach for Dealing With Concurvity and Selecting Variables Supplementary materials for this article are available online. PleaseclicktheJCGSlinkathttp://pubs.amstat.org. Partial Generalized Additive Models: An Information-Theoretic Approach for Dealing With Concurvity

More information

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS STATISTICA, anno LXXIV, n. 1, 2014 ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS Sonia Amodio Department of Economics and Statistics, University of Naples Federico II, Via Cinthia 21,

More information

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking Model checking overview Checking & Selecting GAMs Simon Wood Mathematical Sciences, University of Bath, U.K. Since a GAM is just a penalized GLM, residual plots should be checked exactly as for a GLM.

More information

Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded

Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded Modelling with smooth functions Simon Wood University of Bath, EPSRC funded Some data... Daily respiratory deaths, temperature and ozone in Chicago (NMMAPS) 0 50 100 150 200 deaths temperature ozone 2000

More information

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20 Logistic regression 11 Nov 2010 Logistic regression (EPFL) Applied Statistics 11 Nov 2010 1 / 20 Modeling overview Want to capture important features of the relationship between a (set of) variable(s)

More information

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) Generalized Additive Models (GAMs) Israel Borokini Advanced Analysis Methods in Natural Resources and Environmental Science (NRES 746) October 3, 2016 Outline Quick refresher on linear regression Generalized

More information

GENERALIZED ADDITIVE MODELS FOR DATA WITH CONCURVITY: STATISTICAL ISSUES AND A NOVEL MODEL FITTING APPROACH. by Shui He B.S., Fudan University, 1993

GENERALIZED ADDITIVE MODELS FOR DATA WITH CONCURVITY: STATISTICAL ISSUES AND A NOVEL MODEL FITTING APPROACH. by Shui He B.S., Fudan University, 1993 GENERALIZED ADDITIVE MODELS FOR DATA WITH CONCURVITY: STATISTICAL ISSUES AND A NOVEL MODEL FITTING APPROACH by Shui He B.S., Fudan University, 1993 Submitted to the Graduate Faculty of the Graduate School

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Math 3215 Intro. Probability & Statistics Summer 14. Homework 5: Due 7/3/14

Math 3215 Intro. Probability & Statistics Summer 14. Homework 5: Due 7/3/14 Math 325 Intro. Probability & Statistics Summer Homework 5: Due 7/3/. Let X and Y be continuous random variables with joint/marginal p.d.f. s f(x, y) 2, x y, f (x) 2( x), x, f 2 (y) 2y, y. Find the conditional

More information

STATISTICAL MODELS FOR QUANTIFYING THE SPATIAL DISTRIBUTION OF SEASONALLY DERIVED OZONE STANDARDS

STATISTICAL MODELS FOR QUANTIFYING THE SPATIAL DISTRIBUTION OF SEASONALLY DERIVED OZONE STANDARDS STATISTICAL MODELS FOR QUANTIFYING THE SPATIAL DISTRIBUTION OF SEASONALLY DERIVED OZONE STANDARDS Eric Gilleland Douglas Nychka Geophysical Statistics Project National Center for Atmospheric Research Supported

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Lecture 21: Convergence of transformations and generating a random variable

Lecture 21: Convergence of transformations and generating a random variable Lecture 21: Convergence of transformations and generating a random variable If Z n converges to Z in some sense, we often need to check whether h(z n ) converges to h(z ) in the same sense. Continuous

More information

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

Modelling Survival Data using Generalized Additive Models with Flexible Link

Modelling Survival Data using Generalized Additive Models with Flexible Link Modelling Survival Data using Generalized Additive Models with Flexible Link Ana L. Papoila 1 and Cristina S. Rocha 2 1 Faculdade de Ciências Médicas, Dep. de Bioestatística e Informática, Universidade

More information

Towards a Regression using Tensors

Towards a Regression using Tensors February 27, 2014 Outline Background 1 Background Linear Regression Tensorial Data Analysis 2 Definition Tensor Operation Tensor Decomposition 3 Model Attention Deficit Hyperactivity Disorder Data Analysis

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Combining Interval and Probabilistic Uncertainty in Engineering Applications

Combining Interval and Probabilistic Uncertainty in Engineering Applications Combining Interval and Probabilistic Uncertainty in Engineering Applications Andrew Pownuk Computational Science Program University of Texas at El Paso El Paso, Texas 79968, USA ampownuk@utep.edu Page

More information

Exercises with solutions (Set D)

Exercises with solutions (Set D) Exercises with solutions Set D. A fair die is rolled at the same time as a fair coin is tossed. Let A be the number on the upper surface of the die and let B describe the outcome of the coin toss, where

More information

American Journal of EPIDEMIOLOGY

American Journal of EPIDEMIOLOGY Volume 156 Number 3 August 1, 2002 American Journal of EPIDEMIOLOGY Copyright 2002 by The Johns Hopkins Bloomberg School of Public Health Sponsored by the Society for Epidemiologic Research Published by

More information

Sparse Functional Models: Predicting Crop Yields

Sparse Functional Models: Predicting Crop Yields Sparse Functional Models: Predicting Crop Yields Dustin Lennon Lead Statistician dustin@inferentialist.com Executive Summary Here, we develop a general methodology for extracting optimal functionals mapping

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K. An Introduction to GAMs based on penalied regression splines Simon Wood Mathematical Sciences, University of Bath, U.K. Generalied Additive Models (GAM) A GAM has a form something like: g{e(y i )} = η

More information

An application of the GAM-PCA-VAR model to respiratory disease and air pollution data

An application of the GAM-PCA-VAR model to respiratory disease and air pollution data An application of the GAM-PCA-VAR model to respiratory disease and air pollution data Márton Ispány 1 Faculty of Informatics, University of Debrecen Hungary Joint work with Juliana Bottoni de Souza, Valdério

More information

Least Squares Estimation-Finite-Sample Properties

Least Squares Estimation-Finite-Sample Properties Least Squares Estimation-Finite-Sample Properties Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Finite-Sample 1 / 29 Terminology and Assumptions 1 Terminology and Assumptions

More information

A short introduction to INLA and R-INLA

A short introduction to INLA and R-INLA A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk

More information

Lecture 8: Channel Capacity, Continuous Random Variables

Lecture 8: Channel Capacity, Continuous Random Variables EE376A/STATS376A Information Theory Lecture 8-02/0/208 Lecture 8: Channel Capacity, Continuous Random Variables Lecturer: Tsachy Weissman Scribe: Augustine Chemparathy, Adithya Ganesh, Philip Hwang Channel

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs.

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg Statistics for Data Analysis PSI Practical Course 2014 Niklaus Berger Physics Institute, University of Heidelberg Overview You are going to perform a data analysis: Compare measured distributions to theoretical

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

Linear Regression for Air Pollution Data

Linear Regression for Air Pollution Data UNIVERSITY OF TEXAS AT SAN ANTONIO Linear Regression for Air Pollution Data Liang Jing April 2008 1 1 GOAL The increasing health problems caused by traffic-related air pollution have caught more and more

More information

Multiple Random Variables

Multiple Random Variables Multiple Random Variables Joint Probability Density Let X and Y be two random variables. Their joint distribution function is F ( XY x, y) P X x Y y. F XY ( ) 1, < x

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Inversion Base Height. Daggot Pressure Gradient Visibility (miles) Stanford University June 2, 1998 Bayesian Backtting: 1 Bayesian Backtting Trevor Hastie Stanford University Rob Tibshirani University of Toronto Email: trevor@stat.stanford.edu Ftp: stat.stanford.edu:

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018 Work all problems. 60 points are needed to pass at the Masters Level and 75

More information

Regularization Methods for Additive Models

Regularization Methods for Additive Models Regularization Methods for Additive Models Marta Avalos, Yves Grandvalet, and Christophe Ambroise HEUDIASYC Laboratory UMR CNRS 6599 Compiègne University of Technology BP 20529 / 60205 Compiègne, France

More information

Generalized Linear Models 1

Generalized Linear Models 1 Generalized Linear Models 1 STA 2101/442: Fall 2012 1 See last slide for copyright information. 1 / 24 Suggested Reading: Davison s Statistical models Exponential families of distributions Sec. 5.2 Chapter

More information

ON THE FORWARD AND BACKWARD ALGORITHMS OF PROJECTION PURSUIT 1. BY MU ZHU University of Waterloo

ON THE FORWARD AND BACKWARD ALGORITHMS OF PROJECTION PURSUIT 1. BY MU ZHU University of Waterloo The Annals of Statistics 2004, Vol. 32, No. 1, 233 244 Institute of Mathematical Statistics, 2004 ON THE FORWARD AND BACKWARD ALGORITHMS OF PROJECTION PURSUIT 1 BY MU ZHU University of Waterloo This article

More information

How to deal with non-linear count data? Macro-invertebrates in wetlands

How to deal with non-linear count data? Macro-invertebrates in wetlands How to deal with non-linear count data? Macro-invertebrates in wetlands In this session we l recognize the advantages of making an effort to better identify the proper error distribution of data and choose

More information

Categorical data analysis Chapter 5

Categorical data analysis Chapter 5 Categorical data analysis Chapter 5 Interpreting parameters in logistic regression The sign of β determines whether π(x) is increasing or decreasing as x increases. The rate of climb or descent increases

More information

GENERALIZED LINEAR MODELING APPROACH TO STOCHASTIC WEATHER GENERATORS

GENERALIZED LINEAR MODELING APPROACH TO STOCHASTIC WEATHER GENERATORS GENERALIZED LINEAR MODELING APPROACH TO STOCHASTIC WEATHER GENERATORS Rick Katz Institute for Study of Society and Environment National Center for Atmospheric Research Boulder, CO USA Joint work with Eva

More information

Generalized Linear Models: An Introduction

Generalized Linear Models: An Introduction Applied Statistics With R Generalized Linear Models: An Introduction John Fox WU Wien May/June 2006 2006 by John Fox Generalized Linear Models: An Introduction 1 A synthesis due to Nelder and Wedderburn,

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

P -spline ANOVA-type interaction models for spatio-temporal smoothing

P -spline ANOVA-type interaction models for spatio-temporal smoothing P -spline ANOVA-type interaction models for spatio-temporal smoothing Dae-Jin Lee 1 and María Durbán 1 1 Department of Statistics, Universidad Carlos III de Madrid, SPAIN. e-mail: dae-jin.lee@uc3m.es and

More information

Copula Based Independent Component Analysis

Copula Based Independent Component Analysis Copula Based Independent Component Analysis CUNY October 2008 Kobi Abayomi + + Asst. Professor Industrial Engineering - Statistics Group Georgia Institute of Technology October 2008 Introduction Outline

More information

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008. Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter 7.1-7.9 Model Assessment and Selection CN700/March 4, 2008 Satyavarta sat@cns.bu.edu Auditory Neuroscience Laboratory, Department

More information

Lecture 17: Differential Entropy

Lecture 17: Differential Entropy Lecture 17: Differential Entropy Differential entropy AEP for differential entropy Quantization Maximum differential entropy Estimation counterpart of Fano s inequality Dr. Yao Xie, ECE587, Information

More information

Chapter 12: Bivariate & Conditional Distributions

Chapter 12: Bivariate & Conditional Distributions Chapter 12: Bivariate & Conditional Distributions James B. Ramsey March 2007 James B. Ramsey () Chapter 12 26/07 1 / 26 Introduction Key relationships between joint, conditional, and marginal distributions.

More information

Bayesian data analysis in practice: Three simple examples

Bayesian data analysis in practice: Three simple examples Bayesian data analysis in practice: Three simple examples Martin P. Tingley Introduction These notes cover three examples I presented at Climatea on 5 October 0. Matlab code is available by request to

More information

Gaussian processes. Basic Properties VAG002-

Gaussian processes. Basic Properties VAG002- Gaussian processes The class of Gaussian processes is one of the most widely used families of stochastic processes for modeling dependent data observed over time, or space, or time and space. The popularity

More information

Interaction effects for continuous predictors in regression modeling

Interaction effects for continuous predictors in regression modeling Interaction effects for continuous predictors in regression modeling Testing for interactions The linear regression model is undoubtedly the most commonly-used statistical model, and has the advantage

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches Sta 216, Lecture 4 Last Time: Logistic regression example, existence/uniqueness of MLEs Today s Class: 1. Hypothesis testing through analysis of deviance 2. Standard errors & confidence intervals 3. Model

More information

Flexible Spatio-temporal smoothing with array methods

Flexible Spatio-temporal smoothing with array methods Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session IPS046) p.849 Flexible Spatio-temporal smoothing with array methods Dae-Jin Lee CSIRO, Mathematics, Informatics and

More information

Regression Estimation Least Squares and Maximum Likelihood

Regression Estimation Least Squares and Maximum Likelihood Regression Estimation Least Squares and Maximum Likelihood Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1 Least Squares Max(min)imization Function to minimize

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K.

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K. mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K. mgcv, gamm4 mgcv is a package supplied with R for generalized additive modelling, including generalized additive mixed models.

More information

Generalized Additive Models

Generalized Additive Models Generalized Additive Models The Model The GLM is: g( µ) = ß 0 + ß 1 x 1 + ß 2 x 2 +... + ß k x k The generalization to the GAM is: g(µ) = ß 0 + f 1 (x 1 ) + f 2 (x 2 ) +... + f k (x k ) where the functions

More information

Stat 451 Lecture Notes Monte Carlo Integration

Stat 451 Lecture Notes Monte Carlo Integration Stat 451 Lecture Notes 06 12 Monte Carlo Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 6 in Givens & Hoeting, Chapter 23 in Lange, and Chapters 3 4 in Robert & Casella 2 Updated:

More information

Inference in Regression Analysis

Inference in Regression Analysis Inference in Regression Analysis Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 4, Slide 1 Today: Normal Error Regression Model Y i = β 0 + β 1 X i + ǫ i Y i value

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University

More information

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression University of Haifa From the SelectedWorks of Philip T. Reiss August 7, 2008 Simultaneous Confidence Bands for the Coefficient Function in Functional Regression Philip T. Reiss, New York University Available

More information

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables Regression Analysis Regression: Methodology for studying the relationship among two or more variables Two major aims: Determine an appropriate model for the relationship between the variables Predict the

More information

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB) The Poisson transform for unnormalised statistical models Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB) Part I Unnormalised statistical models Unnormalised statistical models

More information

Conditional distributions. Conditional expectation and conditional variance with respect to a variable.

Conditional distributions. Conditional expectation and conditional variance with respect to a variable. Conditional distributions Conditional expectation and conditional variance with respect to a variable Probability Theory and Stochastic Processes, summer semester 07/08 80408 Conditional distributions

More information

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page. EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 28 Please submit on Gradescope. Start every question on a new page.. Maximum Differential Entropy (a) Show that among all distributions supported

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Spacetime models in R-INLA. Elias T. Krainski

Spacetime models in R-INLA. Elias T. Krainski Spacetime models in R-INLA Elias T. Krainski 2 Outline Separable space-time models Infant mortality in Paraná PM-10 concentration in Piemonte, Italy 3 Multivariate dynamic regression model y t : n observations

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 18 Outline 1 Logistic regression for Binary data 2 Poisson regression for Count data 2 / 18 GLM Let Y denote a binary response variable. Each observation

More information

MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Introduction Outline Introduction Copula Specification Heuristic Example Simple Example The Copula perspective Mutual Information as Copula dependent

Introduction Outline Introduction Copula Specification Heuristic Example Simple Example The Copula perspective Mutual Information as Copula dependent Copula Based Independent Component Analysis SAMSI 2008 Abayomi, Kobi + + SAMSI 2008 April 2008 Introduction Outline Introduction Copula Specification Heuristic Example Simple Example The Copula perspective

More information

MIT Spring 2016

MIT Spring 2016 Generalized Linear Models MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Generalized Linear Models 1 Generalized Linear Models 2 Generalized Linear Model Data: (y i, x i ), i = 1,..., n where y i : response

More information

Lecture 1: August 28

Lecture 1: August 28 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 1: August 28 Our broad goal for the first few lectures is to try to understand the behaviour of sums of independent random

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Semi-Parametric Importance Sampling for Rare-event probability Estimation

Semi-Parametric Importance Sampling for Rare-event probability Estimation Semi-Parametric Importance Sampling for Rare-event probability Estimation Z. I. Botev and P. L Ecuyer IMACS Seminar 2011 Borovets, Bulgaria Semi-Parametric Importance Sampling for Rare-event probability

More information

Rank-Based Methods. Lukas Meier

Rank-Based Methods. Lukas Meier Rank-Based Methods Lukas Meier 20.01.2014 Introduction Up to now we basically always used a parametric family, like the normal distribution N (µ, σ 2 ) for modeling random data. Based on observed data

More information

of the 7 stations. In case the number of daily ozone maxima in a month is less than 15, the corresponding monthly mean was not computed, being treated

of the 7 stations. In case the number of daily ozone maxima in a month is less than 15, the corresponding monthly mean was not computed, being treated Spatial Trends and Spatial Extremes in South Korean Ozone Seokhoon Yun University of Suwon, Department of Applied Statistics Suwon, Kyonggi-do 445-74 South Korea syun@mail.suwon.ac.kr Richard L. Smith

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 05 Full points may be obtained for correct answers to eight questions Each numbered question (which may have several parts) is worth

More information

Independent Component Analysis

Independent Component Analysis 1 Independent Component Analysis Background paper: http://www-stat.stanford.edu/ hastie/papers/ica.pdf 2 ICA Problem X = AS where X is a random p-vector representing multivariate input measurements. S

More information

Chapter 11 Lecture Outline. Heating the Atmosphere

Chapter 11 Lecture Outline. Heating the Atmosphere Chapter 11 Lecture Outline Heating the Atmosphere They are still here! Focus on the Atmosphere Weather Occurs over a short period of time Constantly changing Climate Averaged over a long period of time

More information

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005 Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach Radford M. Neal, 28 February 2005 A Very Brief Review of Gaussian Processes A Gaussian process is a distribution over

More information

Random Variables. P(x) = P[X(e)] = P(e). (1)

Random Variables. P(x) = P[X(e)] = P(e). (1) Random Variables Random variable (discrete or continuous) is used to derive the output statistical properties of a system whose input is a random variable or random in nature. Definition Consider an experiment

More information

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

Chapter 4 Multiple Random Variables

Chapter 4 Multiple Random Variables Review for the previous lecture Theorems and Examples: How to obtain the pmf (pdf) of U = g ( X Y 1 ) and V = g ( X Y) Chapter 4 Multiple Random Variables Chapter 43 Bivariate Transformations Continuous

More information

A significance test for the lasso

A significance test for the lasso 1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra

More information

CS145: Probability & Computing Lecture 11: Derived Distributions, Functions of Random Variables

CS145: Probability & Computing Lecture 11: Derived Distributions, Functions of Random Variables CS145: Probability & Computing Lecture 11: Derived Distributions, Functions of Random Variables Instructor: Erik Sudderth Brown University Computer Science March 5, 2015 Homework Submissions Electronic

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation Curtis B. Storlie a a Los Alamos National Laboratory E-mail:storlie@lanl.gov Outline Reduction of Emulator

More information

A Hybrid ARIMA and Neural Network Model to Forecast Particulate. Matter Concentration in Changsha, China

A Hybrid ARIMA and Neural Network Model to Forecast Particulate. Matter Concentration in Changsha, China A Hybrid ARIMA and Neural Network Model to Forecast Particulate Matter Concentration in Changsha, China Guangxing He 1, Qihong Deng 2* 1 School of Energy Science and Engineering, Central South University,

More information

Bivariate distributions

Bivariate distributions Bivariate distributions 3 th October 017 lecture based on Hogg Tanis Zimmerman: Probability and Statistical Inference (9th ed.) Bivariate Distributions of the Discrete Type The Correlation Coefficient

More information

Estimating complex causal effects from incomplete observational data

Estimating complex causal effects from incomplete observational data Estimating complex causal effects from incomplete observational data arxiv:1403.1124v2 [stat.me] 2 Jul 2014 Abstract Juha Karvanen Department of Mathematics and Statistics, University of Jyväskylä, Jyväskylä,

More information