A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Size: px

Start display at page:

Download "A Fully Nonparametric Modeling Approach to. BNP Binary Regression"

Gavin Spencer
6 years ago
Views:

1 A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012

2 Outline Simulation Example Atmospheric Measurements Credit Card Data 4

3 Outline Simulation Example Atmospheric Measurements Credit Card Data 4

4 Outline Simulation Example Atmospheric Measurements Credit Card Data 4

5 Outline Simulation Example Atmospheric Measurements Credit Card Data 4

6 Motivation binary responses along with covariates are present in many settings, including biometrics, econometrics, and social sciences Goal: determine the relationship between response and covariates examples: credit scoring, medicine, population dynamics, environmental sciences the response-covariate relationship is described by the regression function standard approaches involve linearity and distributional assumptions, e.g., GLMs

7 Motivation binary responses along with covariates are present in many settings, including biometrics, econometrics, and social sciences Goal: determine the relationship between response and covariates examples: credit scoring, medicine, population dynamics, environmental sciences the response-covariate relationship is described by the regression function standard approaches involve linearity and distributional assumptions, e.g., GLMs

8 Bayesian Nonparametrics Bayesian nonparametrics can be used to relax common distributional assumptions, resulting in flexible regression models with proper uncertainty quantification rather than modeling directly the regression function, model the joint distribution of response and covariates using a nonparametric mixture model (West et al., 1994, Müller et al., 1996) this implies a form for the conditional response distribution, which is implicitly modeled nonparametrically involves random covariates

9 Bayesian Nonparametrics Bayesian nonparametrics can be used to relax common distributional assumptions, resulting in flexible regression models with proper uncertainty quantification rather than modeling directly the regression function, model the joint distribution of response and covariates using a nonparametric mixture model (West et al., 1994, Müller et al., 1996) this implies a form for the conditional response distribution, which is implicitly modeled nonparametrically involves random covariates

10 Bayesian Nonparametrics Bayesian nonparametrics can be used to relax common distributional assumptions, resulting in flexible regression models with proper uncertainty quantification rather than modeling directly the regression function, model the joint distribution of response and covariates using a nonparametric mixture model (West et al., 1994, Müller et al., 1996) this implies a form for the conditional response distribution, which is implicitly modeled nonparametrically involves random covariates

11 Latent Variable Formulation introduce latent continuous random variables z that determine the binary responses y, so that y = 1 if-f z > 0 (e.g., Albert and Chib, 1993) estimate the joint distribution of latent responses and covariates f (z, x) using a nonparametric mixture model, to obtain flexible inference for the regression function pr(y = 1 x) the latent variables may be of interest in some applications, containing more information than just a 0/1 observation in biology applications, these may be thought of as maturity, latent survivorship, or measure of health

12 Latent Variable Formulation introduce latent continuous random variables z that determine the binary responses y, so that y = 1 if-f z > 0 (e.g., Albert and Chib, 1993) estimate the joint distribution of latent responses and covariates f (z, x) using a nonparametric mixture model, to obtain flexible inference for the regression function pr(y = 1 x) the latent variables may be of interest in some applications, containing more information than just a 0/1 observation in biology applications, these may be thought of as maturity, latent survivorship, or measure of health

13 Latent Variable Formulation introduce latent continuous random variables z that determine the binary responses y, so that y = 1 if-f z > 0 (e.g., Albert and Chib, 1993) estimate the joint distribution of latent responses and covariates f (z, x) using a nonparametric mixture model, to obtain flexible inference for the regression function pr(y = 1 x) the latent variables may be of interest in some applications, containing more information than just a 0/1 observation in biology applications, these may be thought of as maturity, latent survivorship, or measure of health

14 Outline Simulation Example Atmospheric Measurements Credit Card Data 4

15 DP Mixture Model The Dirichlet Process (DP) (Ferguson, 1973) generates random distributions, and can be used as a prior for spaces of distribution functions. DP constructive definition (Sethuraman, 1994): if G DP(α, G 0 ), then it is almost surely of the form l=1 p lδ νl ν l iid G0, l = 1, 2,... iid z r Beta(1, α), r = 1, 2,... l 1 define p 1 = z 1, and p l = z l r=1 (1 z r ), for l = 2, 3,... DP mixture model for the latent responses and covariates f (z, x; G) = N p+1 (z, x; µ, Σ)dG(µ, Σ) G α, ψ DP(α, G 0 (µ, Σ; ψ))

16 DP Mixture Model The Dirichlet Process (DP) (Ferguson, 1973) generates random distributions, and can be used as a prior for spaces of distribution functions. DP constructive definition (Sethuraman, 1994): if G DP(α, G 0 ), then it is almost surely of the form l=1 p lδ νl ν l iid G0, l = 1, 2,... iid z r Beta(1, α), r = 1, 2,... l 1 define p 1 = z 1, and p l = z l r=1 (1 z r ), for l = 2, 3,... DP mixture model for the latent responses and covariates f (z, x; G) = N p+1 (z, x; µ, Σ)dG(µ, Σ) G α, ψ DP(α, G 0 (µ, Σ; ψ))

17 Implied Conditional Regression From the constructive definition, the model has an a.s. representation as a countable mixture of MVNs f (z, x; G) = p l N p+1 (z, x; µ l, Σ l ) l=1 Binary regression functional: pr(y = 1 x; G) marginalize over z to obtain f (x; G) and f (y, x; G) f (x; G) = p l N p (x; µ x l, Σxx l ) l=1 And the joint distribution f (y, x; G) = ( ( p l N p (x; µ x l, Σxx l )Bern y; Φ l=1 µ z l + Σ zx l (Σ xx (Σ zz l Σ zx l )) l ) 1 (x µ x l ) ) 1 Σ xz l ) 1/2 (Σ xx l

18 Implied Conditional Regression From the constructive definition, the model has an a.s. representation as a countable mixture of MVNs f (z, x; G) = p l N p+1 (z, x; µ l, Σ l ) l=1 Binary regression functional: pr(y = 1 x; G) marginalize over z to obtain f (x; G) and f (y, x; G) f (x; G) = p l N p (x; µ x l, Σxx l ) l=1 And the joint distribution f (y, x; G) = ( ( p l N p (x; µ x l, Σxx l )Bern y; Φ l=1 µ z l + Σ zx l (Σ xx (Σ zz l Σ zx l )) l ) 1 (x µ x l ) ) 1 Σ xz l ) 1/2 (Σ xx l

19 Implied Conditional Regression From the constructive definition, the model has an a.s. representation as a countable mixture of MVNs f (z, x; G) = p l N p+1 (z, x; µ l, Σ l ) l=1 Binary regression functional: pr(y = 1 x; G) marginalize over z to obtain f (x; G) and f (y, x; G) f (x; G) = p l N p (x; µ x l, Σxx l ) l=1 And the joint distribution f (y, x; G) = ( ( p l N p (x; µ x l, Σxx l )Bern y; Φ l=1 µ z l + Σ zx l (Σ xx (Σ zz l Σ zx l )) l ) 1 (x µ x l ) ) 1 Σ xz l ) 1/2 (Σ xx l

20 The Regression Function implied regression function: pr(y = 1 x; G) = l=1 w l(x)π l (x), with covariate dependent weights and probabilities π l (x) = Φ w l (x) p l N(x; µ x l, Σxx l ) ( µ z l + Σ zx l (Σ xx (Σ zz l Σ zx l ) l ) 1 (x µ x l ) ) 1 Σ xz l ) 1/2 (Σ xx l Notice that the probabilities have the probit form with component-specific intercept and slope parameters

21 The Regression Function implied regression function: pr(y = 1 x; G) = l=1 w l(x)π l (x), with covariate dependent weights and probabilities π l (x) = Φ w l (x) p l N(x; µ x l, Σxx l ) ( µ z l + Σ zx l (Σ xx (Σ zz l Σ zx l ) l ) 1 (x µ x l ) ) 1 Σ xz l ) 1/2 (Σ xx l Notice that the probabilities have the probit form with component-specific intercept and slope parameters

22 Identifiability Can the entire covariance matrix Σ be estimated? Probit Regression: z N(x T β, 1) the binary responses are not able to inform about the scale of the latent responses retaining Σ zx is important, if we set it to 0, then π l (x) becomes just π l We have shown that if Σ zz is fixed, the remaining parameters are identifiable in the kernel of the mixture model for y and x

23 Identifiability Can the entire covariance matrix Σ be estimated? Probit Regression: z N(x T β, 1) the binary responses are not able to inform about the scale of the latent responses retaining Σ zx is important, if we set it to 0, then π l (x) becomes just π l We have shown that if Σ zz is fixed, the remaining parameters are identifiable in the kernel of the mixture model for y and x

24 Identifiability Can the entire covariance matrix Σ be estimated? Probit Regression: z N(x T β, 1) the binary responses are not able to inform about the scale of the latent responses retaining Σ zx is important, if we set it to 0, then π l (x) becomes just π l We have shown that if Σ zz is fixed, the remaining parameters are identifiable in the kernel of the mixture model for y and x

25 Identifiability Can the entire covariance matrix Σ be estimated? Probit Regression: z N(x T β, 1) the binary responses are not able to inform about the scale of the latent responses retaining Σ zx is important, if we set it to 0, then π l (x) becomes just π l We have shown that if Σ zz is fixed, the remaining parameters are identifiable in the kernel of the mixture model for y and x

26 Identifiability Can the entire covariance matrix Σ be estimated? Probit Regression: z N(x T β, 1) the binary responses are not able to inform about the scale of the latent responses retaining Σ zx is important, if we set it to 0, then π l (x) becomes just π l We have shown that if Σ zz is fixed, the remaining parameters are identifiable in the kernel of the mixture model for y and x

27 Facilitating Identifiability How to fix only one element of the covariance matrix? the usual inverse-wishart distribution will not work square-root-free Cholesky decomposition of Σ uses the relationship = βσβ T, with diagonal with all elements δ i > 0, and β lower triangular with 1 on its diagonal (Daniels and Pourahmadi, 2002; Webb and Forster, 2007) For y = (y 1,..., y m ) N(µ, Σ), with = βσβ T, the joint distribution for y can be expressed in a recursive form: y 1 N(µ 1, δ 1 ), (y k y 1,..., y k 1 ) N(µ k k 1 j=1 β k,j(y j µ j ), δ k ), k = 2,..., m useful for modeling longitudinal data and specifying conditional independence assumptions

28 Facilitating Identifiability How to fix only one element of the covariance matrix? the usual inverse-wishart distribution will not work square-root-free Cholesky decomposition of Σ uses the relationship = βσβ T, with diagonal with all elements δ i > 0, and β lower triangular with 1 on its diagonal (Daniels and Pourahmadi, 2002; Webb and Forster, 2007) For y = (y 1,..., y m ) N(µ, Σ), with = βσβ T, the joint distribution for y can be expressed in a recursive form: y 1 N(µ 1, δ 1 ), (y k y 1,..., y k 1 ) N(µ k k 1 j=1 β k,j(y j µ j ), δ k ), k = 2,..., m useful for modeling longitudinal data and specifying conditional independence assumptions

29 Facilitating Identifiability How to fix only one element of the covariance matrix? the usual inverse-wishart distribution will not work square-root-free Cholesky decomposition of Σ uses the relationship = βσβ T, with diagonal with all elements δ i > 0, and β lower triangular with 1 on its diagonal (Daniels and Pourahmadi, 2002; Webb and Forster, 2007) For y = (y 1,..., y m ) N(µ, Σ), with = βσβ T, the joint distribution for y can be expressed in a recursive form: y 1 N(µ 1, δ 1 ), (y k y 1,..., y k 1 ) N(µ k k 1 j=1 β k,j(y j µ j ), δ k ), k = 2,..., m useful for modeling longitudinal data and specifying conditional independence assumptions

30 Facilitating Identifiability How to fix only one element of the covariance matrix? the usual inverse-wishart distribution will not work square-root-free Cholesky decomposition of Σ uses the relationship = βσβ T, with diagonal with all elements δ i > 0, and β lower triangular with 1 on its diagonal (Daniels and Pourahmadi, 2002; Webb and Forster, 2007) For y = (y 1,..., y m ) N(µ, Σ), with = βσβ T, the joint distribution for y can be expressed in a recursive form: y 1 N(µ 1, δ 1 ), (y k y 1,..., y k 1 ) N(µ k k 1 j=1 β k,j(y j µ j ), δ k ), k = 2,..., m useful for modeling longitudinal data and specifying conditional independence assumptions

31 Facilitating Identifiability here, no natural ordering is present, but the paramaterization has other useful properties which we exploit δ 1 = Σ zz fix δ 1, and mix on δ 2,..., δ p+1 and p(p + 1)/2 free elements of β, denoted by vector β Then the DP mixture model becomes f (z, x; G) = N p+1 (z, x; µ, β 1 β T )dg(µ, β, ) computationally convenient: there exist conjugate prior distributions for β and δ 2,..., δ p+1, which are MVN and (independent) inverse-gamma

32 Facilitating Identifiability here, no natural ordering is present, but the paramaterization has other useful properties which we exploit δ 1 = Σ zz fix δ 1, and mix on δ 2,..., δ p+1 and p(p + 1)/2 free elements of β, denoted by vector β Then the DP mixture model becomes f (z, x; G) = N p+1 (z, x; µ, β 1 β T )dg(µ, β, ) computationally convenient: there exist conjugate prior distributions for β and δ 2,..., δ p+1, which are MVN and (independent) inverse-gamma

33 Outline Simulation Example Atmospheric Measurements Credit Card Data 4

34 Hierarchical Model Blocked Gibbs sampler: truncate G to G N ( ) = N l=1 p lδ Wl ( ), with W l = (µ l, β l, l ), and introduce configuration variables (L 1,..., L n ) taking values in 1,..., N. y i z i ind 1 (yi =1)1 (zi >0) + 1 (yi =0)1 (zi 0), i = 1,..., n ind (z i, x i ) W, L i N p+1 ((z i, x i ); µ Li, β 1 Li β T L i ), i = 1,..., n L i p N p l δ l (L i ), l=1 L i i = 1,..., n p+1 W l ψ ind N p+1 (µ l ; m, V )N q ( β l ; θ, ci) IG(δ i,l ; ν i, s i ), i=2 l = 1,..., N

35 Gibbs sampling may be used to simulate from full posterior p(w, L, p, ψ, α, z data), with the conditionally conjugate base distribution, and conjugate priors on ψ and α. The posterior for G N = (p, W ) is imputed in the MCMC, enabling full inference for any functional of f (z, x; G N ), now a finite sum Binary regression functional: for any covariate value x 0, at iteration r of the MCMC, calculate pr(y = 1 x 0 ; G (r) N ) provides point estimate and uncertainty quantification for regression function Same can be done for other functionals, such as latent response distribution f (z x 0 ; G N ) at any covariate value x 0

36 Gibbs sampling may be used to simulate from full posterior p(w, L, p, ψ, α, z data), with the conditionally conjugate base distribution, and conjugate priors on ψ and α. The posterior for G N = (p, W ) is imputed in the MCMC, enabling full inference for any functional of f (z, x; G N ), now a finite sum Binary regression functional: for any covariate value x 0, at iteration r of the MCMC, calculate pr(y = 1 x 0 ; G (r) N ) provides point estimate and uncertainty quantification for regression function Same can be done for other functionals, such as latent response distribution f (z x 0 ; G N ) at any covariate value x 0

37 Gibbs sampling may be used to simulate from full posterior p(w, L, p, ψ, α, z data), with the conditionally conjugate base distribution, and conjugate priors on ψ and α. The posterior for G N = (p, W ) is imputed in the MCMC, enabling full inference for any functional of f (z, x; G N ), now a finite sum Binary regression functional: for any covariate value x 0, at iteration r of the MCMC, calculate pr(y = 1 x 0 ; G (r) N ) provides point estimate and uncertainty quantification for regression function Same can be done for other functionals, such as latent response distribution f (z x 0 ; G N ) at any covariate value x 0

38 Outline Simulation Example Atmospheric Measurements Credit Card Data Simulation Example Atmospheric Measurements Credit Card Data 4

39 Simulated Data Simulation Example Atmospheric Measurements Credit Card Data Data {(z i, x i ) : i = 1,..., n} was simulated from a mixture of 3 bivariate normals, and y determined from z. compare inference from the binary regression model with data (y, x) to that from model which views (z, x) as data a practical prior specification approach which is appropriate when little is known about the problem is applied here to specify priors on ψ, consider only one mixture component and use an approximate center and range of the data, as well as prior simulation to induce an approximate unif( 1, 1) prior on corr(z, x)

40 Simulated Data Simulation Example Atmospheric Measurements Credit Card Data Data {(z i, x i ) : i = 1,..., n} was simulated from a mixture of 3 bivariate normals, and y determined from z. compare inference from the binary regression model with data (y, x) to that from model which views (z, x) as data a practical prior specification approach which is appropriate when little is known about the problem is applied here to specify priors on ψ, consider only one mixture component and use an approximate center and range of the data, as well as prior simulation to induce an approximate unif( 1, 1) prior on corr(z, x)

41 Simulated Data Simulation Example Atmospheric Measurements Credit Card Data Data {(z i, x i ) : i = 1,..., n} was simulated from a mixture of 3 bivariate normals, and y determined from z. compare inference from the binary regression model with data (y, x) to that from model which views (z, x) as data a practical prior specification approach which is appropriate when little is known about the problem is applied here to specify priors on ψ, consider only one mixture component and use an approximate center and range of the data, as well as prior simulation to induce an approximate unif( 1, 1) prior on corr(z, x)

42 Pr(z>0 x;g) Pr(y=1 x;g) x x The inference for pr(z > 0 x; G) (left) is compared to that for pr(y = 1 x; G) (right) and the truth (solid line).

43 f(z x=x1) f(z x=x2) f(z x=x3) z z z f(z x=x1) f(z x=x2) f(z x=x3) z z z Top row: Inference for f (z x 0 ; G) under the model which views z as observed, with true densities as dashed lines, at 3 values of x 0. Bottom: Inference from the binary regression model.

44 Outline Simulation Example Atmospheric Measurements Credit Card Data Simulation Example Atmospheric Measurements Credit Card Data 4

45 Ozone and Wind Speed Simulation Example Atmospheric Measurements Credit Card Data 111 daily measurements of wind speed (mph) and ozone concentration (parts per billion) in NYC over 4 month period objective: model the probability of exceeding a certain ozone concentration as a function of wind speed the model only sees whether or not there was an exceedance, but there is an actual ozone concentration underlying this 0/1 value

46 probability of ozone exceedence ozone concentration wind speed wind speed Left: The probability that ozone concentration (parts per billion) exceeds a threshold of 70 decreases with wind speed (mph). Right: For comparison, here are the actual non-discretized ozone measurements as a function of wind speed.

47 f(z x0) f(z x0) z z f(z x0) f(z x0) z z Estimates for f (z x 0 ; G) at wind speed values of 5, 8, 10, and 15 mph.

48 Outline Simulation Example Atmospheric Measurements Credit Card Data Simulation Example Atmospheric Measurements Credit Card Data 4

49 Credit Cards and Income Simulation Example Atmospheric Measurements Credit Card Data n = 100 subjects in a study were asked whether or not they owned a travel credit card, and their income was recorded (Agresti, 1996) In this situation, it is not clear that there is some meaningful interpretation of the latent continuous random variables, but we can still use the method for regression Does probability of owning a credit card change with income?

50 Pr(y=1 x;g) income in thousands Probability of owning a credit card appears to increase with income, with a slight dip or leveling off around income of 40-50, since all subjects in that region did not own a credit card.

51 Extensions to Ordinal Reponses similar methodology, wider range of applications for an ordinal response with C categories, assume y = j if-f γ j 1 < z γ j, for j = 1,...C, and apply the same DP mixture of MVNs for (z, x) for fixed cut-off points γ, it can be shown that all of µ and Σ are identifiable in the induced kernel for the observables the C 1 free cut-off points can be fixed to arbitrary increasing values (Kottas et al., 2005), which is an attribute in a computational sense

52 Extensions to Ordinal Reponses similar methodology, wider range of applications for an ordinal response with C categories, assume y = j if-f γ j 1 < z γ j, for j = 1,...C, and apply the same DP mixture of MVNs for (z, x) for fixed cut-off points γ, it can be shown that all of µ and Σ are identifiable in the induced kernel for the observables the C 1 free cut-off points can be fixed to arbitrary increasing values (Kottas et al., 2005), which is an attribute in a computational sense

53 Extensions to Ordinal Reponses similar methodology, wider range of applications for an ordinal response with C categories, assume y = j if-f γ j 1 < z γ j, for j = 1,...C, and apply the same DP mixture of MVNs for (z, x) for fixed cut-off points γ, it can be shown that all of µ and Σ are identifiable in the induced kernel for the observables the C 1 free cut-off points can be fixed to arbitrary increasing values (Kottas et al., 2005), which is an attribute in a computational sense

54 Extensions to Ordinal Reponses similar methodology, wider range of applications for an ordinal response with C categories, assume y = j if-f γ j 1 < z γ j, for j = 1,...C, and apply the same DP mixture of MVNs for (z, x) for fixed cut-off points γ, it can be shown that all of µ and Σ are identifiable in the induced kernel for the observables the C 1 free cut-off points can be fixed to arbitrary increasing values (Kottas et al., 2005), which is an attribute in a computational sense

55 Other Extensions multivariate ordinal responses: J ordinal responses associated with a vector of covariates for each subject; with C j categories associated with the jth response several applications, but limited existing methods for flexible inference y and z are vectors, and y j = l if-f γ j,l 1 < z j γ j,l, for j = 1,..., J, and l = 1,..., C j C j > 2 for all j, then no identifiability restrictions needed C j = 2 for some j, then (β, ) paramaterization can be used, and fixing certain elements of δ provides the necessary restrictions mixed ordinal-continuous responses

56 Other Extensions multivariate ordinal responses: J ordinal responses associated with a vector of covariates for each subject; with C j categories associated with the jth response several applications, but limited existing methods for flexible inference y and z are vectors, and y j = l if-f γ j,l 1 < z j γ j,l, for j = 1,..., J, and l = 1,..., C j C j > 2 for all j, then no identifiability restrictions needed C j = 2 for some j, then (β, ) paramaterization can be used, and fixing certain elements of δ provides the necessary restrictions mixed ordinal-continuous responses

57 Other Extensions multivariate ordinal responses: J ordinal responses associated with a vector of covariates for each subject; with C j categories associated with the jth response several applications, but limited existing methods for flexible inference y and z are vectors, and y j = l if-f γ j,l 1 < z j γ j,l, for j = 1,..., J, and l = 1,..., C j C j > 2 for all j, then no identifiability restrictions needed C j = 2 for some j, then (β, ) paramaterization can be used, and fixing certain elements of δ provides the necessary restrictions mixed ordinal-continuous responses

58 Other Extensions multivariate ordinal responses: J ordinal responses associated with a vector of covariates for each subject; with C j categories associated with the jth response several applications, but limited existing methods for flexible inference y and z are vectors, and y j = l if-f γ j,l 1 < z j γ j,l, for j = 1,..., J, and l = 1,..., C j C j > 2 for all j, then no identifiability restrictions needed C j = 2 for some j, then (β, ) paramaterization can be used, and fixing certain elements of δ provides the necessary restrictions mixed ordinal-continuous responses

59 Other Extensions multivariate ordinal responses: J ordinal responses associated with a vector of covariates for each subject; with C j categories associated with the jth response several applications, but limited existing methods for flexible inference y and z are vectors, and y j = l if-f γ j,l 1 < z j γ j,l, for j = 1,..., J, and l = 1,..., C j C j > 2 for all j, then no identifiability restrictions needed C j = 2 for some j, then (β, ) paramaterization can be used, and fixing certain elements of δ provides the necessary restrictions mixed ordinal-continuous responses

60 Conclusions Binary responses measured along with covariates represents a simple setting, but the scope of problems which lie in this category is large. This framework allows flexible, nonparametric inference to be obtained for the regression relationship in a general binary regression problem. The methodology extends easily to larger classes of problems in ordinal regression, including multivariate responses and mixed responses, making the framework much more powerful, with utility in a wide variety of applications.

Nonparametric Bayesian modeling for dynamic ordinal regression relationships

Nonparametric Bayesian modeling for dynamic ordinal regression relationships Athanasios Kottas Department of Applied Mathematics and Statistics, University of California, Santa Cruz Joint work with Maria