Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Similar documents
UNIVERSITY OF TORONTO Faculty of Arts and Science

Logistic Regression and Generalized Linear Models

MS&E 226: Small Data

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

MS&E 226: Small Data

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Classification. Chapter Introduction. 6.2 The Bayes classifier

Review of Statistics

STAT5044: Regression and Anova. Inyoung Kim

Linear Regression Models P8111

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Exam Applied Statistical Regression. Good Luck!

STA 450/4000 S: January

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Statistics for Engineers Lecture 9 Linear Regression

MATH 644: Regression Analysis Methods

STAC51: Categorical data Analysis

ST430 Exam 1 with Answers

Lecture 4 Multiple linear regression

Chapter 1. Linear Regression with One Predictor Variable

First Year Examination Department of Statistics, University of Florida

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

R Hints for Chapter 10

Stat 5101 Lecture Notes

Y i = η + ɛ i, i = 1,...,n.

Inference for Regression

Ch 2: Simple Linear Regression

Stat 5102 Final Exam May 14, 2015

Model Estimation Example

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

MS&E 226: Small Data

Generalized Linear Models Introduction

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Data Mining Stat 588

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Linear Regression Model. Badr Missaoui

Subject CS1 Actuarial Statistics 1 Core Principles

Multiple Linear Regression

SF2930: REGRESION ANALYSIS LECTURE 1 SIMPLE LINEAR REGRESSION.

Lecture 4: Newton s method and gradient descent

Logistic Regressions. Stat 430

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Lectures on Simple Linear Regression Stat 431, Summer 2012

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Regression Analysis. y t = β 1 x t1 + β 2 x t2 + β k x tk + ϵ t, t = 1,..., T,

Lectures 5 & 6: Hypothesis Testing

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Ch 3: Multiple Linear Regression

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

MATH 829: Introduction to Data Mining and Analysis Logistic regression

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Business Statistics. Lecture 10: Course Review

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

8 Nominal and Ordinal Logistic Regression

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Chapter 12: Linear regression II

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Simple Linear Regression

Probability and Probability Distributions. Dr. Mohammed Alahmed

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Section 4.6 Simple Linear Regression

Scatter plot of data from the study. Linear Regression

AMS-207: Bayesian Statistics

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Chapter 8: Simple Linear Regression

Statistics in medicine

Generalized Linear Models and Exponential Families

Financial Econometrics and Quantitative Risk Managenent Return Properties

Logistic Regression - problem 6.14

Introduction to Simple Linear Regression

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Generalized linear models

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Practical Statistics

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Introduction to General and Generalized Linear Models

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Association studies and regression

Chapter 1 Statistical Inference

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Advanced Statistics I : Gaussian Linear Model (and beyond)

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Density Temp vs Ratio. temp

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Transcription:

Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012

Outline 1 Motivation: Why Applied Statistics? 2 Descriptive Statistical Measures and Distributions 3 Mean Regression 4 Logistic Regression as an example of Generalized Linear Model

Part I Motivation: Why Applied Statistics?

Data are Everywhere! http://www.sciencemag.org/site/special/data/ Science magazine in Feb. 2011: Large amount of data, Uncertainty/variability of these data, We want to learn something about these data, but what?

Why Should I Learn Applied Statistics? To develop statistical methods in medicine Example: Construction of Probabilistic Brain Atlas to detect specific patterns of anatomic alterations due to some disease. PBA of Alzeimher s disease: Source: http://www.loni.ucla.edu/~thompson/prob_atlas.html

Why Should I Learn Applied Statistics? To nowcast and forecast the economy Nowcasting: to assess the state of an economy. The Economist in April 2011 Cycle indicator for the US -0.06-0.04-0.02 0.00 0.02 0.04 P T P T P T P T P T P T P T P T P T P T 1950 1960 1970 1980 1990 2000 2010 Source: de Carvalho et al, 2011 Time

Part II Descriptive Statistical Measures and Distributions

Basic notions We have some data x 1,..., x n. We want to learn about the corresponding random variable X. We want to learn about the cumulative distribution function (CDF): F X (x) = Pr(X x), which gives Pr(a < X b) = F X (b) F X (a). If X is a continuous random variable, the CDF can be defined in terms of its probability density function (PDF) f X as follows: F X (x) = x f X (t) dt.

Example: the Normal (or Gaussian) distribution R: dnorm{base}+pnrom{base} The PDF of X N(µ, σ 2 ) is given by: f X (x) = 1 σ (x µ) 2 2π e 2σ 2, x R. There is no closed form for the CDF. PDF CDF µ: mean, σ 2 : variance, σ: standard deviation. N(0, 1): standard Normal distribution.

Moments R: moment + raw2central{moments} In practice, f x is unknown. Some statistics can be useful to describe the data. The l-th moment of X m l = E(X l ) = x l f X (x) dx; m 1 : mean. The l-th central moment of X m l = E(X m 1 ) l = (x m 1 ) l f X (x) dx. In practice approximations are needed; e.g. we use m 1 x = 1 n n i=1 x i to estimate m 1. R R

Features of a Distribution Described using Functions of Moments R: mean{base} + sd{stats} + (skewness + kurtosis){moments} Measure Feature Function of moment Mean Location µ = m 1 Standard deviation Scale σ = m 2 Skewness Asymmetry γ = m 3 /σ3 (Excess) Kurtosis Tailweight κ = m 4 /σ4 3

Skewness and kurtosis R: skewness{moments} + kurtosis{moments} Skewness: γ < 0 vs. γ > 0 Kurtosis κ > 0 vs. κ = 0 vs. κ < 0:

The previous statistics are useful as a first description of the data but they do not fully describe the data. For this, we would need to estimate the PDF f X. If we don t have any knowledge: nonparametric estimation, see lecture in two weeks. In the next slides, we are concerned with problems where we want to learn about of a certain output variable y i by using data from another (input) variable x i. For example, it may be of interest to understand in a class of n students how does the output variable study performance, evolves on average as a function of input variable, time spent on studying. This involves extensions of the concepts of mean, etc., where instead of being interested in learning simply about Y, we are interested in Y X = x.

Part III Mean Regression

Mean Regression: Introducing the linear model Suppose that we collect a random sample {(x i, y i ) : i = 1,..., n}. Assume that a linear specification gives a good approximation for describing the link between covariates x T i = (x 1,..., x k ) and the response variable y i, i.e. y i = x T i β + ε i, where ε i N(0, σ 2 ε ), for i = 1,..., n. Suppose for instance that we want to use the stature of parents (x i ) to predict the stature of their offspring (y i ). A simple linear regression model for modeling this data assumes y i = β 1 + β 2 x i + ε i, i = 1,..., n.

Mean Regression Regression towards mediocrity in hereditary stature (Galton, 1886) R: data(galton){usingr} Figure: Sir Francis Galton (left); scatterplot and regression line for an hereditary stature study (right). The conception of the model is however older having its roots in Theoria Motus Corporum Coelestium, a work due to the prominent mathematician Carl Gauss.

Mean Regression: What s Fixed? What s Random? In the next slides we follow a frequentist approach, and hence the parameter β is assumed to be an unknown but fixed quantity; later we will introduce the Bayesian approach, and there we replace this by the assumption that β is a random quantity. We prefer to assume that covariates x i are random, but there are other approaches where these are assumed to be fixed. We think that this assumption is preferable in many applied fields, and in particular it may be used to justify the idea that there may be some measurement errors in the covariates.

OLS (Ordinary Least Squares) Estimator The OLS estimator ( β 1, β 2 ) T defines the line such that the sum of the Euclidean distances between responses (y i ) and the ordinates of that line (β 1 + β 2 x i ), is minimal across all possible lines, i.e. β ( β1 β 2 ) = arg min (β 1,β 2 ) R 2 ( n (y i β 1 β 2 x i ) 2 = i=1 y β 2 x ni=1 (x i x)(y i y) ni=1 (x i x) 2 ). Here x and y denote the sample means of x i and y i. This is the same as the maximum likelihood estimator (more next week) since we assume that y i x i N(β 1 + β 2 x i, σ 2 ɛ )

OLS (Ordinary Least Squares) Estimator The predicted response is then given by ŷ i = β 1 + β 2 x i. This is the expected value of y i x i, since E(ɛ i ) = 0 The particular case where β 2 = 0 is instructive as the predicted response is simply the sample mean, i.e., ŷ i = y.

OLS (Ordinary Least Squares) Estimator Example In the hereditary stature data we have ( β = y β ) ( 1 x ni=1 (x i x)(y i y) ni=1 (x i x) 2 68.68 0.51 67.69 4171.58/8114.44 ) ( = 33.8866 0.5141 ). The residuals of the fit, are the empirical counterparts of the error of the model, and are defined as e i = y i ŷ i. The assumption of normality of the errors can be tested using for instance the Jarque Bera statistic (R: jarque.bera.test{tseries}).

Fitting a Linear Model for Hereditary Stature Data R: lm{stats}= # Data are from the UsingR package # Uncomment the line below to install it # install.packages("usingr") # Load package, data, and attach data library(usingr) data(father.son) attach(father.son) # Mean regression fit <- lm(sheight~fheight) > fit Call: lm(formula = sheight ~ fheight) Coefficients: (Intercept) fheight 33.8866 0.5141 The fitted model is thus son height i = 33.8866 + 0.5141father height i, i = 1,..., 1078.

Plotting the Hereditary Stature Data Against the Regression Line plot(sheight~fheight,bty = "l",pch = 20,xlab = "Father s Height", ylab="son s Height",cex.lab=1.4) abline(lm(sheight~fheight),lty=1,lwd=3,col="red") Son's Height 60 65 70 75 60 65 70 75 Father's Height

Estimation Uncertainty Although we assume the parameters β = (β 1, β 2 ) are fixed, their estimators, β 1, β 2, are functions of the sample y x, which is random If we observed another sample of data from y x, then the parameter estimates would be different To quantify how different they might be, we look at the distribution of the estimators n 1/2 ( β β) N 2 (0, Σ). where Σ is a variance-covariance matrix depending on unknown quantities In practice therefore we estimate Σ

Estimation Uncertainty The variance covariance matrix of the OLS estimator is Σ β varcov( β) = σε 2 n i=1 x i 2 x i=1 (x i x) 2 n n n i=1 (x i x) 2 x 1 n i=1 (x i x) 2 n i=1 (x i x) 2 In practice to consistently estimate Σ β we plug in a consistent estimator of σ 2 ε in the expression above..

Estimation Uncertainty To make inference on the uncertainty in parameter estimates we use their asymptotic distribution t (κ) = β κ β κ N(0, 1), κ = 1, 2. var( β κ) This relies on var( β κ ) which is often unknown. The appropriate distribution is then t(κ) = β κ β κ t n 2, κ = 1, 2. var( β κ) (For large n, this is approximately Normal.)

Hypothesis testing The distributions of the estimators of β help us understand which range of values would be plausible if we were to observe another sample We can use the distributions to test hypotheses about the model Often of interest is the null hypothesis: H 0 : β 2 = 0 since this is equivalently the hypothesis that the covariate x does not influence the response y

Coefficient of Determination A central question is: does the model explain the variability in the response Y well? The coefficient of determination provides a simple diagnostic to the model fit r 2 = n i=1 (ŷ i y) 2 n i=1 (y i y) 2 = 1 n i=1 e2 i n i=1 (y i y) 2 = 1 error variation total variation in y. The smaller the error variation, relative to the total variation in y, the better the fit.

Back to the Hereditary Stature Data Example In the hereditary stature data we get since σ ε = ( 3.3587 0.0083 Σ β = 0.0083 0.0007 (n 2) 1 n i=1 e2 i 2.437. ). Hypothesis testing: If our null hypothesis H 0 is that β 2 = 0, then t(2) = 19.01 > Q 0.975 (t 1076 ) = 1.96, and hence we reject H 0. The decision of rejecting the null hypothesis could also have been done by noticing that the p-value is smaller than 0.05, i.e. p-value = 2(1 F t1076 ( 19.01 ) 2 10 16 < 0.05 Model checking: r 2 = ni=1 (ŷ i y) 2 ni=1 (y i y) 2 = 2144.58 8532.58 = 0.2513.

Assessing the Fitted Model in R > fit <- lm(sheight~fheight) > summary(fit) Call: lm(formula = sheight ~ fheight) Residuals: Min 1Q Median 3Q Max -8.8772-1.5144-0.0079 1.6285 8.9685 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 33.88660 1.83235 18.49 <2e-16 *** fheight 0.51409 0.02705 19.01 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.437 on 1076 degrees of freedom Multiple R-squared: 0.2513,Adjusted R-squared: 0.2506 F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16

Generalizing the Model: More Covariates Often we believe that the more than one covariate may be important, but the model can be easily generalized as y i = β 1 x i1 + + β k x ik + ε i, ε i N(0, σ 2 ). Inference can still be made in a similar way, but the number of degrees of freedom in the asymptotic distribution t(κ) = β κ β κ t n k, κ = 1,..., k. var( β κ ) In this more general setting we can take advantage of a matrix representation of the model.

Matrix Formulation Consider the following matrices where we store the data and the errors of the model y = y 1, X = x 11. x 1k., ε = ε 1.. y n The model can then be rewritten as x n1 x nk y = Xβ + ε, and to estimate the regression parameter β = (β 1,..., β k ) T we use the OLS estimator now defined as β = arg min β y Xβ 2 = (X T X) 1 X T y, where denotes the L 2 norm. Here the variance covariance matrix is Σ β varcov( β) = σ 2 ε(x T X) 1.. ε n

Mean Regression Theoretical reasons supporting the use of mean regression-based models The model is conceptually simple and computationally easy to implement. The case of heterocedasdicity (i.e. if σ 2 cannot be assumed to be constant over individuals) can lead to some complex problems, but overall the OLS estimator is unbiased, i.e., E( β X) = β. Under mild conditions it can be shown to be (strongly) consistent, i.e. β a.s. β, n. The Gauss Markov theorem provides strong theoretical support for using the OLS estimator in a linear regression model (Knight, 2000, p. 417). But...

Mean Regression Average effects Something to think about: Our linear specification and normality assumption imply that E(y i x i ) = x T i β. This assumes that y i can take any real value, and the errors conditionally upon knowledge of x T i β are continuous and normally distributed But: Not all data can take any value Not all data are continuous We need to generalize the framework for modelling different types of response data

Part IV Logistic Regression as an example of Generalized Linear Model

Introduction In many contexts the response variable of interest is categorical (e.g. Yes, No). Example: we ask n students how many hours they studied and whether they passed the exam. We want to learn the relationships between the output: passing the exam (y i s) and the input: hours of study (x i s). The application of the discussed linear regression models becomes inadequate. An alternative is provided by logistic regression which offers the possibility to model the probabilities of T classes, as functions of covariates, while obeying the constraint that these probabilities need to be in [0, 1], and sum to one. For simplicity, we focus here on the case of a single covariate and T = 2, but this can be adapted to a more general setting (Christensen, 1990, 4; Hastie et al, 4).

Logistic Regression Let p i denote a probability of failure in the ith trial. We wish to predict p i given the information available on a certain covariate x i. The model is based on the following log-linear specification ( ) pi logit(p i ) log = β 1 + β 2 x i, 1 p i where β 1 and β 2 respectively denote an intercept and a slope parameter. The quantity p i /(1 p i ) is the so-called odds ratio. logit( ) is the link function.

Maximum likelihood estimation Inference is typically done by maximum likelihood. Using the fact that the loglikelihood can be written as p i = exp(β 1 + β 2 x i ) 1 + exp(β 1 + β 2 x i ), l = n n log{p i I(failure i )} + log{(1 p i ) I(success i )}. i=1 i=1 In practice this likelihood can be optimized using a Newton Raphson algorithm (Hastie et al, 2001, p. 99), or a Monte Carlo stochastic optimization method (de Carvalho, 2011). A simple possibility is to code the function l in R, and for the optimization use the instruction optim{stats}.

Example: South African Heart Disease Data R: glm{stats} #install.packages(elemstatlearn) library(elemstatlearn)?saheart (...) Format: (...) A data frame with 462 observations on the following 10 variables. sbp systolic blood pressure tobacco cumulative tobacco (kg) ldl low density lipoprotein cholesterol famhist family history of heart disease, a factor with levels Absent Present obesity a numeric vector alcohol current alcohol consumption age age at onset chd response, coronary heart disease

Fitting a Logistic Model on South African Heart Disease Data R: glm{stats} #load and attach the data > data(saheart) > attach(saheart) # Logistic regression fit <- glm(chd~sbp+tobacco+ldl+famhist+obesity +alcohol+age, family=binomial) > summary(fit) Call: glm(formula = chd ~ sbp + tobacco + ldl + famhist + obesity + alcohol + age, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.7517-0.8378-0.4552 0.9292 2.4434 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -4.1295997 0.9641558-4.283 1.84e-05 *** sbp 0.0057607 0.0056326 1.023 0.30643 tobacco 0.0795256 0.0262150 3.034 0.00242 ** ldl 0.1847793 0.0574115 3.219 0.00129 ** famhistpresent 0.9391855 0.2248691 4.177 2.96e-05 *** obesity -0.0345434 0.0291053-1.187 0.23529 alcohol 0.0006065 0.0044550 0.136 0.89171 age 0.0425412 0.0101749 4.181 2.90e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1)

Linear regression vs. logistic regression R: glm{stats} Linear regression: y i = x T i β + ε i, ε iid i N(0, σ 2 ). This is equivalent to Y i X = x. N(µ i (x), σ 2 ), µ i (x) = x T i β. Logistic regression: Y i X = x Bin(n i, p i (x)), logit(p i (x)) = x T i β. All are special case of so-called generalized linear model: Y i X = x F, g(e F (Y X = x)) = x T i β, for a given distribution F (in the exponential family) and a given bijective function g (the link function).

Generalized Linear Model R: glm{stats} Example (Some examples of generalized linear models) Y i X = x F, g(e F (Y X = x)) = x T i β. Name Distribution F Link function g Domain of Y Linear regression Gaussian identity continuous Logistic regression binomial logit discrete 0, 1,..., T Probit model binomial probit discrete 0, 1,..., T Poisson regression Poisson logarithm discrete 0, 1, 2,... Neg. binomial regression Neg. binomial logarithm discrete 0, 1, 2,... Exponential regression Exponential inverse continuous 0 Gamma regression Gamma inverse continuous 0 Inv. Gaussian regression Inv. Gaussian inverse squared continuous > 0............

Take-Home Message Different regression models for different problems. What could be a suitable model for my problem? What are interesting explanatory variables? (more about this in the next week...) How much confidence can I put in my analysis? Robustness, power,... Are the underlying hypothesis of my model fulfilled? Maximum likelihood is a powerful principle of estimation and inference (more about this in the next week...).