Beyond Mean Regression

Similar documents
Analysing geoadditive regression data: a mixed model approach

gamboostlss: boosting generalized additive models for location, scale and shape

Modelling geoadditive survival data

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

Modeling Real Estate Data using Quantile Regression

School on Modelling Tools and Capacity Building in Climate and Public Health April 2013

A general mixed model approach for spatio-temporal regression data

mboost - Componentwise Boosting for Generalised Regression Models

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Model-based boosting in R

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Conditional Transformation Models

Regularization in Cox Frailty Models

STA414/2104 Statistical Methods for Machine Learning II

A short introduction to INLA and R-INLA

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Bayesian Regression Linear and Logistic Regression

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian non-parametric model to longitudinally predict churn

Variational Principal Components

STA 4273H: Sta-s-cal Machine Learning

Bayesian linear regression

STA 4273H: Statistical Machine Learning

A Bayesian perspective on GMM and IV

On the Dependency of Soccer Scores - A Sparse Bivariate Poisson Model for the UEFA EURO 2016

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Bayesian Methods for Machine Learning

STAT 518 Intro Student Presentation

Working Papers in Economics and Statistics

STA 4273H: Statistical Machine Learning

Bayesian Machine Learning

Markov Chain Monte Carlo (MCMC)

CPSC 540: Machine Learning

On Bayesian Computation

Density Estimation. Seungjin Choi

Variable Selection and Model Choice in Structured Survival Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Principles of Bayesian Inference

Bayesian Linear Regression

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STA 4273H: Statistical Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Recent Advances in Bayesian Inference Techniques

MCMC Sampling for Bayesian Inference using L1-type Priors

Bayesian inference for multivariate skew-normal and skew-t distributions

Gaussian Process Regression

Integrated Non-Factorized Variational Inference

Machine Learning Techniques for Computer Vision

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Stat 5101 Lecture Notes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Accounting for Complex Sample Designs via Mixture Models

Using the package hypergsplines: some examples.

17 : Markov Chain Monte Carlo

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Spatially Adaptive Smoothing Splines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Markov Chain Monte Carlo methods

Graphical Models for Collaborative Filtering

Riemann Manifold Methods in Bayesian Statistics

Bayesian Inference: Probit and Linear Probability Models

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Estimating Timber Volume using Airborne Laser Scanning Data based on Bayesian Methods J. Breidenbach 1 and E. Kublin 2

Beyond MCMC in fitting complex Bayesian models: The INLA method

Lecture 9: PGM Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Statistics & Data Sciences: First Year Prelim Exam May 2018

Kyle Reing University of Southern California April 18, 2018

Unsupervised Learning

POSTERIOR ANALYSIS OF THE MULTIPLICATIVE HETEROSCEDASTICITY MODEL

Monte Carlo in Bayesian Statistics

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture : Probabilistic Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Boosting structured additive quantile regression for longitudinal childhood obesity data

Calibrating Environmental Engineering Models and Uncertainty Analysis

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

Estimation of Operational Risk Capital Charge under Parameter Uncertainty

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Principles of Bayesian Inference

MODULE -4 BAYEIAN LEARNING

Pattern Recognition and Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Part 6: Multivariate Normal and Linear Models

NORGES TEKNISK-NATURVITENSKAPELIGE UNIVERSITET

Quantile Regression for Extraordinarily Large Data

Marginal Specifications and a Gaussian Copula Estimation

Lecture 13 Fundamentals of Bayesian Inference

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

Beyond Mean Regression Thomas Kneib Lehrstuhl für Statistik Georg-August-Universität Göttingen 8.3.2013 Innsbruck

Introduction Introduction One of the top ten reasons to become statistician (according to Friedman, Friedman & Amoo): Statisticians are mean lovers. Focus on means in particular in regression model to reduce complexity. Obviously, a mean is not sufficient to fully describe a distribution. Beyond Mean Regression 1

Introduction Usual regression models based on data (y i, z i ) for a continuous response variable y and covariates z: y i = η i + ε i, where η i is a regression predictor formed in terms of the covariates z i. Assumptions on the error term: E(ε i ) = 0, Var(ε i ) = σ 2, or ε i N(0, σ 2 ). Beyond Mean Regression 2

Introduction The assumptions on the error term imply the following properties of the response distribution The predictor determines the expectation of the response: E(y i z i ) = η i. Homoscedasticity of the response: Var(y i z i ) = σ 2. Parallel quantile curves of the response (if the errors are also normal): Q τ (y i z i ) = η i + z τ σ. Beyond Mean Regression 3

Why could this be problematic? Introduction The variance of the responses may depend on covariates (heteroscedasticity). Other higher order characteristics (skewness, curtosis,... ) of the responses may depend on covariates. Generic interest in extreme observations or the complete conditional distribution of the response. Beyond Mean Regression 4

Introduction Example: Munich rental guide (illustrative application in this talk). Explain the net rent for a specific flat in terms of covariates such as living area or year of construction. Published to give reference intervals of usual rents for both tenants and landlords. We are not interested in average rents but rather in an interval covering typical rents. rent in Euro 0 500 1000 1500 2000 rent in Euro 0 500 1000 1500 2000 20 40 60 80 100 120 140 160 living area 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 5

Some further examples: Introduction Analysing childhood BMI patterns in (post-) industrialized countries, where interest is mainly on extreme forms of overweight (obesity). Studying covariate effects on extreme forms of malnutrition in developing countries. Efficiency estimation in agricultural production, where interest is on evaluating above-average performance of farms. Modelling gas flow networks, where the behavior of the network in high or low demand situations shall be studied. Beyond Mean Regression 6

More flexible regression approaches considered in the following: Introduction Regression models for location, scale and shape. Quantile regression. Expectile regression. Beyond Mean Regression 7

Regression models for location, scale and shape: Introduction Retain the assumption of a specific error distribution but allow covariate effects not only on the mean. Simplest example: Regression for mean and variance of a normal distribution where y i = η i1 + exp(η i2 )ε i, ε i N(0, 1), such that E(y i z i ) = η i1 Var(y i z i ) = exp(η i2 ) 2. In general: Specify a distribution for the response, where (potentially) all parameters are related to predictors. Beyond Mean Regression 8

Quantile and expectile regression: Introduction Drop the parametric assumption for the error / response distribution and instead estimate separate models for different asymmetries τ [0, 1]: y i = η iτ + ε iτ, Instead of assuming E(ε iτ ) = 0, we can for example assume P (ε iτ 0) = τ, i.e. the τ-quantile of the error term is zero. Yields a regression model for the quantiles of the response. A dense set of quantiles completely characterizes the conditional distribution of the response. Expectiles are a computationally attractive alternative to quantiles. Beyond Mean Regression 9

Introduction Estimated quantile curves for the Munich rental guide with linear effect of living area and quadratic effect for year of construction. Homoscedastic linear model: rent in Euro 500 0 500 1000 1500 2000 20 40 60 80 100 120 140 160 living area rent in Euro 0 500 1000 1500 2000 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 10

Heteroscedastic linear model: Introduction rent in Euro 500 0 500 1000 1500 2000 20 40 60 80 100 120 140 160 living area rent in Euro 0 500 1000 1500 2000 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 11

Quantile regression: Introduction rent in Euro 500 0 500 1000 1500 2000 20 40 60 80 100 120 140 160 living area rent in Euro 0 500 1000 1500 2000 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 12

Introduction Usually, modern regression data contain more complex structures such that linear predictors are not enough. For example, in the Munich rental guide the effects of living area and size of the flat may be of complex nonlinear form (instead of simply polynomial) and a spatial effect based on the subquarter information may be included to capture effects of missing covariates and spatial correlation. Consider semiparametric extensions. Beyond Mean Regression 13

Overview for the Rest of the Talk Overview for the Rest of the Talk Semiparametric Predictor Specifications. More on Models: Generalized Additive Models for Location, Scale and Shape. Quantile Regression. Expectile Regression. Inferential Procedures & Comparison of the Approaches. Beyond Mean Regression 14

Semiparametric Regression Semiparametric Regression Semiparametric regression provides a generic framework for flexible regression models with predictor η = β 0 + f 1 (z) +... + f r (z) where f 1,..., f r are generic functions of the covariate vector z. Types of effects: Linear effects: f(z) = x β. Nonlinear, smooth effects of continuous covariates: f(z) = f(x). Varying coefficients: f(z) = uf(x). Interaction surfaces: f(z) = f(x 1, x 2 ). Spatial effects: f(z) = f spat (s). Random effects: f(z) = b c with cluster index c. Beyond Mean Regression 15

Generic model description based on Semiparametric Regression a design matrix Z j, such that the vector of function evaluations f j = (f j (z 1 ),..., f j (z n )) can be written as f j = Z j γ j. a quadratic penalty term pen(f j ) = pen(γ j ) = γ jk j γ j which operationalises smoothness properties of f j. From a Bayesian perspective, the penalty term corresponds to a multivariate Gaussian prior ( ) p(γ j ) exp 1 2δj 2 γ jk j γ j. Beyond Mean Regression 16

Estimation then relies on a penalised fit criterion, e.g. Semiparametric Regression n (y i η i ) 2 + i=1 r λ j γ jk j γ j j=1 with smoothing parameters λ j 0. Beyond Mean Regression 17

Semiparametric Regression Example 1. Penalised splines for nonlinear effects f(x): Approximate f(x) in terms of a linear combination of B-spline basis functions f(x) = k γ k B k (x). Large variability in the estimates corresponds to large differences in adjacent coefficients yielding the penalty term pen(γ) = k ( d γ k ) 2 = γ D dd d γ with difference operator d and difference matrix D d of order d. The corresponding Bayesian prior is a random walk of order d, e.g. γ k = γ k 1 + u k, γ k = 2γ k 1 + γ k 2 + u k with u k i. i. d. N(0, δ 2 ). Beyond Mean Regression 18

Semiparametric Regression Beyond Mean Regression 19

Semiparametric Regression Example 2. Markov random fields for the estimation of spatial effects based on regional data: Estimate a separate regression coefficient γ s for each region, i.e. f = Zγ with Z[i, s] = { 1 observation i belongs to region s 0 otherwise Penalty term based on differences of neighboring regions: pen(γ) = s r N(s) (γ s γ r ) 2 = γ Kγ where N(s) is the set of neighbors of region s and K is an adjacency matrix. An equivalent Bayesian prior structure is obtained based on Gaussian Markov random fields. Beyond Mean Regression 20

Inferential Procedures Inferential Procedures For each of the three model classes discussed in the following, we will consider three potential avenues for inference: Direct optimization of a fit criterion (e.g. maximum likelihood estimation for GAMLSS). Bayesian approaches. Functional gradient descent boosting. Beyond Mean Regression 21

Functional gradient descent boosting: Inferential Procedures Define the estimation problem in terms of a loss function ρ (e.g. the negative log-likelihood). Use the negative gradients of the loss function evaluated at the current fit as a measure for lack of fit. Iteratively fit simple base-learning procedures to the negative gradients to update the model fit. Componentwise updates of only the best-fitting model component yield automatic variable selection and model choice. For semiparametric regression, penalized least squares estimates provide suitable base-learners. Beyond Mean Regression 22

Generalized Additive Models for Location, Scale and Shape Generalized Additive Models for Location, Scale and Shape GAMLSS provide a unified framework for semiparametric regression models in the case of complex response distributions depending on up to four parameters (µ i, σ i, ν i, ξ i ) where usually µ i is the location parameter, σ i is the scale parameter, and ν i and ξ i are shape parameters determining for example skewness or kurtosis. Each parameter is related to a regression predictor via a suitable response function, i.e. µ i = h 1 (η i,µ ), σ i = h 2 (η i,σ ),... Beyond Mean Regression 23

Generalized Additive Models for Location, Scale and Shape A very broad class of distributions is supported for both discrete and continuous responses. Most important examples for continuous responses: Two-parameter normal distribution (location and scale). Three-parameter power exponential distribution (location, scale and kurtosis). Three-parameter t distribution (location, scale and degrees of freedom). Three-parameter gamma distribution (location, scale and shape). Four-parameter Box-Cox power distribution (location, scale, skewness and kurtosis). Beyond Mean Regression 24

Direct optimization: Generalized Additive Models for Location, Scale and Shape For GAMLSS, the likelihood is available due to the explicit assumption made for the distribution of the response. Maximization can be achieved by penalized iteratively weighted least squares (IWLS) estimation. Estimation and choice of the smoothing parameters is challenging at least for complex models. Bayesian inference: Inference based on Markov chain Monte Carlo (MCMC) simulations is in principle straightforward but requires careful choice of the proposal densities. Promising results obtained based on IWLS proposals. Smoothing parameter choice is immediately included. Beyond Mean Regression 25

Boosting: Generalized Additive Models for Location, Scale and Shape Due to the multiple predictors, the usual boosting framework has to be adapted but basically still works. Beyond Mean Regression 26

Generalized Additive Models for Location, Scale and Shape Results for the Munich rental guide obtained with an additive model for location and scale: mean: area mean: year of construction 200 0 200 400 600 50 0 50 100 150 20 40 60 80 100 120 140 160 area in sqm 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 27

Generalized Additive Models for Location, Scale and Shape standard dev.: area standard dev.: year of construction 0.5 0.0 0.5 1.0 0.2 0.1 0.0 0.1 0.2 0.3 20 40 60 80 100 120 140 160 area in sqm 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 28

Quantile Regression Quantile Regression The theoretical τ-quantile q τ for a continuous random variable is characterized by P (Y q τ ) τ and P (Y q τ ) 1 τ. Estimation of quantiles based on i.i.d. samples y 1,..., y n can be accomplished by ˆq τ = argmin q n w τ (y i, q) y i q i=1 with asymmetric weights w τ (y i, q) = 1 τ y i < q 0 y i = q τ y i > q. Beyond Mean Regression 29

Quantile Regression Plot of the weighted losses w τ (y, q) y q (for q = 0) Beyond Mean Regression 30

Quantile regression starts with the regression model Quantile Regression y i = η iτ + ε iτ. Instead of assuming E(ε iτ ) = 0 as in mean regression, we assume i.e. the τ-quantile of the error is zero. F εiτ (0) = P (ε iτ 0) = τ This implies that the predictor coincides with the τ-quantile of the conditional distribution of the response, i.e. F yi (η iτ ) = P (y i η iτ ) = τ. Beyond Mean Regression 31

Quantile regression therefore Quantile Regression is distribution-free since it does not make any specific assumptions on the type of errors. does not even require i.i.d. errors. allows for heteroscedasticity. Beyond Mean Regression 32

Quantile Regression Note that each parametric regression models also induces a quantile regression model. Example: The heteroscedastic normal model y N(η 1, exp(η 2 ) 2 ) yields q τ = η 1 + exp(η 2 )z τ. Beyond Mean Regression 33

Direct optimisation: Quantile Regression Classical estimation is achieved by minimizing n w τ (y i, η iτ ) y i η iτ + i=1 p λ j pen(f j ). j=1 Can be solved with linear programming as long as the penalties are also linear functionals, e.g. for total variation penalization pen(f j ) = f j (x) dx. Does not fit well with the class of quadratic penalties we are considering. Smoothing parameter selection is still challenging in particular with multiple smoothing parameters. Beyond Mean Regression 34

Bayesian inference Quantile Regression Although quantile regression is distribution-free, there is an auxiliary error distribution that links ML estimation to quantile regression. Assume an asymmetric Laplace distribution for the responses, i.e. y i ALD(η iτ, σ 2, τ) with density exp ( w τ (y i, η iτ ) y ) i η iτ σ 2. Maximizing the resulting likelihood exp ( n i=1 ) w τ (y i, η iτ ) y i η iτ σ 2 is equivalent to minimizing the quantile loss criterion. Beyond Mean Regression 35

Quantile Regression A computationally attractive way of working with the ALD in a Bayesian framework is its scale-mixture representation If z i σ 2 Exp(1/σ 2 ) and y i z i, η iτ, σ 2 N(η iτ + ξz i, σ 2 /w i ) with ξ = 1 2τ τ(1 τ), w i = 1 δ 2 z i, δ 2 = then y i is marginally ALD(η iτ, σ 2, τ) distributed. 2 τ(1 τ). Allows to construct efficient Gibbs samplers or variational Bayes approximations to explore the posterior after imputing z i as additional unknowns. Beyond Mean Regression 36

Boosting: Quantile Regression Boosting can be immediately applied in the quantile regression context since it is formulated in terms of a loss function. Negative gradients are defined almost everywhere, i.e. no conceptual problems. Beyond Mean Regression 37

Results for a geoadditive Bayesian quantile regression model: Quantile Regression τ=0.1 τ=0.2 150 0 150 150 0 150 τ=0.5 τ=0.9 150 0 150 150 0 150 Beyond Mean Regression 38

Quantile Regression f( living area ) 500 0 500 1000 f( year of construction ) 100 0 50 150 20 40 60 80 100 120 140 160 living area 1920 1940 1960 1980 2000 year of construction f( living area ) 500 0 500 1000 f( year of construction ) 100 0 50 150 20 40 60 80 100 120 140 160 living area 1920 1940 1960 1980 2000 year of construction f( living area ) 500 0 500 1000 f( year of construction ) 100 0 50 150 20 40 60 80 100 120 140 160 living area 1920 1940 1960 1980 2000 year of construction Beyond Mean Regression 39

Expectile Regression Expectile Regression What is expectile regression? n y i η i min i=1 median regression n w τ (y i, η iτ ) y i η iτ min i=1 quantile regression n y i η i 2 min i=1 mean regression?? expectile regression Beyond Mean Regression 40

Expectile Regression Expectile Regression What is expectile regression? n y i η i min i=1 median regression n w τ (y i, η iτ ) y i η iτ min i=1 quantile regression n y i η i 2 min i=1 mean regression n w τ (y i, η iτ ) y i η iτ 2 min i=1 expectile regression Beyond Mean Regression 41

Theoretical expectiles are obtained by solving Expectile Regression τ = eτ y e τ f y (y)dy y e τ f y (y)dy = G y (e τ ) e τ F y (e τ ) 2(G y (e τ ) e τ F y (e τ )) + (e τ µ) where f y ( ) and F y ( ) denote the density and cumulative distribution function of y, G y (e) = e yf y(y)dy is the partial moment function of y and G y ( ) = µ is the expectation of y. Beyond Mean Regression 42

Direct optimization: Expectile Regression Since the expectile loss is differentiable, estimates for the basis coefficients can be obtained by iterating ˆγ [t+1] jτ = (Z jw [t] τ Z j + λ j K j ) 1 Z jw [t] τ y. A combination with mixed model methodology allows to estimate the smoothing parameters. Beyond Mean Regression 43

Bayesian inference: Expectile Regression Similarly as for quantile regression, an asymmetric normal distribution can be defined as auxiliary distribution for the responses. No scale mixture representation known so far. Bayesian formulation probably less important since inference is directly tractable. Boosting: Boosting can be immediately applied in the expectile regression context. Beyond Mean Regression 44

Comparison Comparison Advantages of GAMLSS: One joint model for the distribution of the responses. Interpretability of the estimated effects in terms of parameters of the response distribution. Quantiles (or expectiles) derived from GAMLSS will always be coherent, i.e. ordering will be preserved. Readily available in both frequentist and Bayesian formulation. Disadvantages of GAMLSS: Potential for misspecification of the observation model. Model checking difficult in complex settings. If quantiles are of ultimate interest, GAMLSS do not provide direct estimates for these. Beyond Mean Regression 45

Advantages of quantile regression: Comparison Completely distribution-free approach. Easy interpretation in terms of conditional quantiles. Bayesian formulation enables very flexible, fully data-driven semiparametric specifications of the predictor. Disadvantages of quantile regression: Bayesian formulation requires an auxiliary error distribution (that will usually be a misspecification). Estimated cumulative distribution function is a step function even for continuous data. Additional efforts required to avoid crossing of quantile curves. Beyond Mean Regression 46

Advantages of expectile regression: Comparison Computationally simple (iteratively weighted least squares). Still allows to characterize the complete conditional distribution of the response. Quantiles (or conditional distributions) can be computed based on expectiles. Expectiles seem to be more efficient in close-to-gaussian situations then quantiles. Expectile crossing seems to be less of an issue as compared to quantile crossing. The estimated expectile curve is smooth. Disadvantages of expectile regression: Immediate interpretation of expectiles is difficult. Beyond Mean Regression 47

Summary Summary There is more than mean regression! Semiparametric extensions become available also for models beyond mean regression. You can do this at home: Quantile regression: R-package quantreg. Bayesian quantile regression: BayesX (MCMC) and forthcoming R-package on variational Bayes approximations (VA). GAMLSS: R-packages gamlss and gamboostlss. Expectile regression: R-package expectreg. Interesting addition to the models considered: Modal regression (yet to be explored). Beyond Mean Regression 48

Acknowledgements: Summary This talk is mostly based on joint work with Nora Fenske, Benjamin Hofner, Torsten Hothorn, Göran Kauermann, Stefan Lang, Andreas Mayr, Matthias Schmid, Linda Schulze Waltrup, Fabian Sobotka, Elisabeth Waldmann and Yu Yue. Financial support has been provided by the German Research Foundation (DFG). A place called home: http://www.statoek.wiso.uni-goettingen.de Beyond Mean Regression 49