Beyond Mean Regression Thomas Kneib Lehrstuhl für Statistik Georg-August-Universität Göttingen 8.3.2013 Innsbruck
Introduction Introduction One of the top ten reasons to become statistician (according to Friedman, Friedman & Amoo): Statisticians are mean lovers. Focus on means in particular in regression model to reduce complexity. Obviously, a mean is not sufficient to fully describe a distribution. Beyond Mean Regression 1
Introduction Usual regression models based on data (y i, z i ) for a continuous response variable y and covariates z: y i = η i + ε i, where η i is a regression predictor formed in terms of the covariates z i. Assumptions on the error term: E(ε i ) = 0, Var(ε i ) = σ 2, or ε i N(0, σ 2 ). Beyond Mean Regression 2
Introduction The assumptions on the error term imply the following properties of the response distribution The predictor determines the expectation of the response: E(y i z i ) = η i. Homoscedasticity of the response: Var(y i z i ) = σ 2. Parallel quantile curves of the response (if the errors are also normal): Q τ (y i z i ) = η i + z τ σ. Beyond Mean Regression 3
Why could this be problematic? Introduction The variance of the responses may depend on covariates (heteroscedasticity). Other higher order characteristics (skewness, curtosis,... ) of the responses may depend on covariates. Generic interest in extreme observations or the complete conditional distribution of the response. Beyond Mean Regression 4
Introduction Example: Munich rental guide (illustrative application in this talk). Explain the net rent for a specific flat in terms of covariates such as living area or year of construction. Published to give reference intervals of usual rents for both tenants and landlords. We are not interested in average rents but rather in an interval covering typical rents. rent in Euro 0 500 1000 1500 2000 rent in Euro 0 500 1000 1500 2000 20 40 60 80 100 120 140 160 living area 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 5
Some further examples: Introduction Analysing childhood BMI patterns in (post-) industrialized countries, where interest is mainly on extreme forms of overweight (obesity). Studying covariate effects on extreme forms of malnutrition in developing countries. Efficiency estimation in agricultural production, where interest is on evaluating above-average performance of farms. Modelling gas flow networks, where the behavior of the network in high or low demand situations shall be studied. Beyond Mean Regression 6
More flexible regression approaches considered in the following: Introduction Regression models for location, scale and shape. Quantile regression. Expectile regression. Beyond Mean Regression 7
Regression models for location, scale and shape: Introduction Retain the assumption of a specific error distribution but allow covariate effects not only on the mean. Simplest example: Regression for mean and variance of a normal distribution where y i = η i1 + exp(η i2 )ε i, ε i N(0, 1), such that E(y i z i ) = η i1 Var(y i z i ) = exp(η i2 ) 2. In general: Specify a distribution for the response, where (potentially) all parameters are related to predictors. Beyond Mean Regression 8
Quantile and expectile regression: Introduction Drop the parametric assumption for the error / response distribution and instead estimate separate models for different asymmetries τ [0, 1]: y i = η iτ + ε iτ, Instead of assuming E(ε iτ ) = 0, we can for example assume P (ε iτ 0) = τ, i.e. the τ-quantile of the error term is zero. Yields a regression model for the quantiles of the response. A dense set of quantiles completely characterizes the conditional distribution of the response. Expectiles are a computationally attractive alternative to quantiles. Beyond Mean Regression 9
Introduction Estimated quantile curves for the Munich rental guide with linear effect of living area and quadratic effect for year of construction. Homoscedastic linear model: rent in Euro 500 0 500 1000 1500 2000 20 40 60 80 100 120 140 160 living area rent in Euro 0 500 1000 1500 2000 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 10
Heteroscedastic linear model: Introduction rent in Euro 500 0 500 1000 1500 2000 20 40 60 80 100 120 140 160 living area rent in Euro 0 500 1000 1500 2000 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 11
Quantile regression: Introduction rent in Euro 500 0 500 1000 1500 2000 20 40 60 80 100 120 140 160 living area rent in Euro 0 500 1000 1500 2000 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 12
Introduction Usually, modern regression data contain more complex structures such that linear predictors are not enough. For example, in the Munich rental guide the effects of living area and size of the flat may be of complex nonlinear form (instead of simply polynomial) and a spatial effect based on the subquarter information may be included to capture effects of missing covariates and spatial correlation. Consider semiparametric extensions. Beyond Mean Regression 13
Overview for the Rest of the Talk Overview for the Rest of the Talk Semiparametric Predictor Specifications. More on Models: Generalized Additive Models for Location, Scale and Shape. Quantile Regression. Expectile Regression. Inferential Procedures & Comparison of the Approaches. Beyond Mean Regression 14
Semiparametric Regression Semiparametric Regression Semiparametric regression provides a generic framework for flexible regression models with predictor η = β 0 + f 1 (z) +... + f r (z) where f 1,..., f r are generic functions of the covariate vector z. Types of effects: Linear effects: f(z) = x β. Nonlinear, smooth effects of continuous covariates: f(z) = f(x). Varying coefficients: f(z) = uf(x). Interaction surfaces: f(z) = f(x 1, x 2 ). Spatial effects: f(z) = f spat (s). Random effects: f(z) = b c with cluster index c. Beyond Mean Regression 15
Generic model description based on Semiparametric Regression a design matrix Z j, such that the vector of function evaluations f j = (f j (z 1 ),..., f j (z n )) can be written as f j = Z j γ j. a quadratic penalty term pen(f j ) = pen(γ j ) = γ jk j γ j which operationalises smoothness properties of f j. From a Bayesian perspective, the penalty term corresponds to a multivariate Gaussian prior ( ) p(γ j ) exp 1 2δj 2 γ jk j γ j. Beyond Mean Regression 16
Estimation then relies on a penalised fit criterion, e.g. Semiparametric Regression n (y i η i ) 2 + i=1 r λ j γ jk j γ j j=1 with smoothing parameters λ j 0. Beyond Mean Regression 17
Semiparametric Regression Example 1. Penalised splines for nonlinear effects f(x): Approximate f(x) in terms of a linear combination of B-spline basis functions f(x) = k γ k B k (x). Large variability in the estimates corresponds to large differences in adjacent coefficients yielding the penalty term pen(γ) = k ( d γ k ) 2 = γ D dd d γ with difference operator d and difference matrix D d of order d. The corresponding Bayesian prior is a random walk of order d, e.g. γ k = γ k 1 + u k, γ k = 2γ k 1 + γ k 2 + u k with u k i. i. d. N(0, δ 2 ). Beyond Mean Regression 18
Semiparametric Regression Beyond Mean Regression 19
Semiparametric Regression Example 2. Markov random fields for the estimation of spatial effects based on regional data: Estimate a separate regression coefficient γ s for each region, i.e. f = Zγ with Z[i, s] = { 1 observation i belongs to region s 0 otherwise Penalty term based on differences of neighboring regions: pen(γ) = s r N(s) (γ s γ r ) 2 = γ Kγ where N(s) is the set of neighbors of region s and K is an adjacency matrix. An equivalent Bayesian prior structure is obtained based on Gaussian Markov random fields. Beyond Mean Regression 20
Inferential Procedures Inferential Procedures For each of the three model classes discussed in the following, we will consider three potential avenues for inference: Direct optimization of a fit criterion (e.g. maximum likelihood estimation for GAMLSS). Bayesian approaches. Functional gradient descent boosting. Beyond Mean Regression 21
Functional gradient descent boosting: Inferential Procedures Define the estimation problem in terms of a loss function ρ (e.g. the negative log-likelihood). Use the negative gradients of the loss function evaluated at the current fit as a measure for lack of fit. Iteratively fit simple base-learning procedures to the negative gradients to update the model fit. Componentwise updates of only the best-fitting model component yield automatic variable selection and model choice. For semiparametric regression, penalized least squares estimates provide suitable base-learners. Beyond Mean Regression 22
Generalized Additive Models for Location, Scale and Shape Generalized Additive Models for Location, Scale and Shape GAMLSS provide a unified framework for semiparametric regression models in the case of complex response distributions depending on up to four parameters (µ i, σ i, ν i, ξ i ) where usually µ i is the location parameter, σ i is the scale parameter, and ν i and ξ i are shape parameters determining for example skewness or kurtosis. Each parameter is related to a regression predictor via a suitable response function, i.e. µ i = h 1 (η i,µ ), σ i = h 2 (η i,σ ),... Beyond Mean Regression 23
Generalized Additive Models for Location, Scale and Shape A very broad class of distributions is supported for both discrete and continuous responses. Most important examples for continuous responses: Two-parameter normal distribution (location and scale). Three-parameter power exponential distribution (location, scale and kurtosis). Three-parameter t distribution (location, scale and degrees of freedom). Three-parameter gamma distribution (location, scale and shape). Four-parameter Box-Cox power distribution (location, scale, skewness and kurtosis). Beyond Mean Regression 24
Direct optimization: Generalized Additive Models for Location, Scale and Shape For GAMLSS, the likelihood is available due to the explicit assumption made for the distribution of the response. Maximization can be achieved by penalized iteratively weighted least squares (IWLS) estimation. Estimation and choice of the smoothing parameters is challenging at least for complex models. Bayesian inference: Inference based on Markov chain Monte Carlo (MCMC) simulations is in principle straightforward but requires careful choice of the proposal densities. Promising results obtained based on IWLS proposals. Smoothing parameter choice is immediately included. Beyond Mean Regression 25
Boosting: Generalized Additive Models for Location, Scale and Shape Due to the multiple predictors, the usual boosting framework has to be adapted but basically still works. Beyond Mean Regression 26
Generalized Additive Models for Location, Scale and Shape Results for the Munich rental guide obtained with an additive model for location and scale: mean: area mean: year of construction 200 0 200 400 600 50 0 50 100 150 20 40 60 80 100 120 140 160 area in sqm 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 27
Generalized Additive Models for Location, Scale and Shape standard dev.: area standard dev.: year of construction 0.5 0.0 0.5 1.0 0.2 0.1 0.0 0.1 0.2 0.3 20 40 60 80 100 120 140 160 area in sqm 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 28
Quantile Regression Quantile Regression The theoretical τ-quantile q τ for a continuous random variable is characterized by P (Y q τ ) τ and P (Y q τ ) 1 τ. Estimation of quantiles based on i.i.d. samples y 1,..., y n can be accomplished by ˆq τ = argmin q n w τ (y i, q) y i q i=1 with asymmetric weights w τ (y i, q) = 1 τ y i < q 0 y i = q τ y i > q. Beyond Mean Regression 29
Quantile Regression Plot of the weighted losses w τ (y, q) y q (for q = 0) Beyond Mean Regression 30
Quantile regression starts with the regression model Quantile Regression y i = η iτ + ε iτ. Instead of assuming E(ε iτ ) = 0 as in mean regression, we assume i.e. the τ-quantile of the error is zero. F εiτ (0) = P (ε iτ 0) = τ This implies that the predictor coincides with the τ-quantile of the conditional distribution of the response, i.e. F yi (η iτ ) = P (y i η iτ ) = τ. Beyond Mean Regression 31
Quantile regression therefore Quantile Regression is distribution-free since it does not make any specific assumptions on the type of errors. does not even require i.i.d. errors. allows for heteroscedasticity. Beyond Mean Regression 32
Quantile Regression Note that each parametric regression models also induces a quantile regression model. Example: The heteroscedastic normal model y N(η 1, exp(η 2 ) 2 ) yields q τ = η 1 + exp(η 2 )z τ. Beyond Mean Regression 33
Direct optimisation: Quantile Regression Classical estimation is achieved by minimizing n w τ (y i, η iτ ) y i η iτ + i=1 p λ j pen(f j ). j=1 Can be solved with linear programming as long as the penalties are also linear functionals, e.g. for total variation penalization pen(f j ) = f j (x) dx. Does not fit well with the class of quadratic penalties we are considering. Smoothing parameter selection is still challenging in particular with multiple smoothing parameters. Beyond Mean Regression 34
Bayesian inference Quantile Regression Although quantile regression is distribution-free, there is an auxiliary error distribution that links ML estimation to quantile regression. Assume an asymmetric Laplace distribution for the responses, i.e. y i ALD(η iτ, σ 2, τ) with density exp ( w τ (y i, η iτ ) y ) i η iτ σ 2. Maximizing the resulting likelihood exp ( n i=1 ) w τ (y i, η iτ ) y i η iτ σ 2 is equivalent to minimizing the quantile loss criterion. Beyond Mean Regression 35
Quantile Regression A computationally attractive way of working with the ALD in a Bayesian framework is its scale-mixture representation If z i σ 2 Exp(1/σ 2 ) and y i z i, η iτ, σ 2 N(η iτ + ξz i, σ 2 /w i ) with ξ = 1 2τ τ(1 τ), w i = 1 δ 2 z i, δ 2 = then y i is marginally ALD(η iτ, σ 2, τ) distributed. 2 τ(1 τ). Allows to construct efficient Gibbs samplers or variational Bayes approximations to explore the posterior after imputing z i as additional unknowns. Beyond Mean Regression 36
Boosting: Quantile Regression Boosting can be immediately applied in the quantile regression context since it is formulated in terms of a loss function. Negative gradients are defined almost everywhere, i.e. no conceptual problems. Beyond Mean Regression 37
Results for a geoadditive Bayesian quantile regression model: Quantile Regression τ=0.1 τ=0.2 150 0 150 150 0 150 τ=0.5 τ=0.9 150 0 150 150 0 150 Beyond Mean Regression 38
Quantile Regression f( living area ) 500 0 500 1000 f( year of construction ) 100 0 50 150 20 40 60 80 100 120 140 160 living area 1920 1940 1960 1980 2000 year of construction f( living area ) 500 0 500 1000 f( year of construction ) 100 0 50 150 20 40 60 80 100 120 140 160 living area 1920 1940 1960 1980 2000 year of construction f( living area ) 500 0 500 1000 f( year of construction ) 100 0 50 150 20 40 60 80 100 120 140 160 living area 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 39
Expectile Regression Expectile Regression What is expectile regression? n y i η i min i=1 median regression n w τ (y i, η iτ ) y i η iτ min i=1 quantile regression n y i η i 2 min i=1 mean regression?? expectile regression Beyond Mean Regression 40
Expectile Regression Expectile Regression What is expectile regression? n y i η i min i=1 median regression n w τ (y i, η iτ ) y i η iτ min i=1 quantile regression n y i η i 2 min i=1 mean regression n w τ (y i, η iτ ) y i η iτ 2 min i=1 expectile regression Beyond Mean Regression 41
Theoretical expectiles are obtained by solving Expectile Regression τ = eτ y e τ f y (y)dy y e τ f y (y)dy = G y (e τ ) e τ F y (e τ ) 2(G y (e τ ) e τ F y (e τ )) + (e τ µ) where f y ( ) and F y ( ) denote the density and cumulative distribution function of y, G y (e) = e yf y(y)dy is the partial moment function of y and G y ( ) = µ is the expectation of y. Beyond Mean Regression 42
Direct optimization: Expectile Regression Since the expectile loss is differentiable, estimates for the basis coefficients can be obtained by iterating ˆγ [t+1] jτ = (Z jw [t] τ Z j + λ j K j ) 1 Z jw [t] τ y. A combination with mixed model methodology allows to estimate the smoothing parameters. Beyond Mean Regression 43
Bayesian inference: Expectile Regression Similarly as for quantile regression, an asymmetric normal distribution can be defined as auxiliary distribution for the responses. No scale mixture representation known so far. Bayesian formulation probably less important since inference is directly tractable. Boosting: Boosting can be immediately applied in the expectile regression context. Beyond Mean Regression 44
Comparison Comparison Advantages of GAMLSS: One joint model for the distribution of the responses. Interpretability of the estimated effects in terms of parameters of the response distribution. Quantiles (or expectiles) derived from GAMLSS will always be coherent, i.e. ordering will be preserved. Readily available in both frequentist and Bayesian formulation. Disadvantages of GAMLSS: Potential for misspecification of the observation model. Model checking difficult in complex settings. If quantiles are of ultimate interest, GAMLSS do not provide direct estimates for these. Beyond Mean Regression 45
Advantages of quantile regression: Comparison Completely distribution-free approach. Easy interpretation in terms of conditional quantiles. Bayesian formulation enables very flexible, fully data-driven semiparametric specifications of the predictor. Disadvantages of quantile regression: Bayesian formulation requires an auxiliary error distribution (that will usually be a misspecification). Estimated cumulative distribution function is a step function even for continuous data. Additional efforts required to avoid crossing of quantile curves. Beyond Mean Regression 46
Advantages of expectile regression: Comparison Computationally simple (iteratively weighted least squares). Still allows to characterize the complete conditional distribution of the response. Quantiles (or conditional distributions) can be computed based on expectiles. Expectiles seem to be more efficient in close-to-gaussian situations then quantiles. Expectile crossing seems to be less of an issue as compared to quantile crossing. The estimated expectile curve is smooth. Disadvantages of expectile regression: Immediate interpretation of expectiles is difficult. Beyond Mean Regression 47
Summary Summary There is more than mean regression! Semiparametric extensions become available also for models beyond mean regression. You can do this at home: Quantile regression: R-package quantreg. Bayesian quantile regression: BayesX (MCMC) and forthcoming R-package on variational Bayes approximations (VA). GAMLSS: R-packages gamlss and gamboostlss. Expectile regression: R-package expectreg. Interesting addition to the models considered: Modal regression (yet to be explored). Beyond Mean Regression 48
Acknowledgements: Summary This talk is mostly based on joint work with Nora Fenske, Benjamin Hofner, Torsten Hothorn, Göran Kauermann, Stefan Lang, Andreas Mayr, Matthias Schmid, Linda Schulze Waltrup, Fabian Sobotka, Elisabeth Waldmann and Yu Yue. Financial support has been provided by the German Research Foundation (DFG). A place called home: http://www.statoek.wiso.uni-goettingen.de Beyond Mean Regression 49