Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Size: px

Start display at page:

Download "Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne"

Dwight Rich
6 years ago
Views:

1 Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012

2 Outline 1 Motivation: Why Applied Statistics? 2 Descriptive Statistical Measures and Distributions 3 Mean Regression 4 Logistic Regression as an example of Generalized Linear Model

3 Part I Motivation: Why Applied Statistics?

Data are Everywhere! http://www.sciencemag.org/site/special/data/ Science magazine in Feb.

4 Data are Everywhere! Science magazine in Feb. 2011: Large amount of data, Uncertainty/variability of these data, We want to learn something about these data, but what?

5 Why Should I Learn Applied Statistics? To develop statistical methods in medicine Example: Construction of Probabilistic Brain Atlas to detect specific patterns of anatomic alterations due to some disease. PBA of Alzeimher s disease: Source:

6 Why Should I Learn Applied Statistics? To nowcast and forecast the economy Nowcasting: to assess the state of an economy. The Economist in April 2011 Cycle indicator for the US P T P T P T P T P T P T P T P T P T P T Source: de Carvalho et al, 2011 Time

7 Part II Descriptive Statistical Measures and Distributions

8 Basic notions We have some data x 1,..., x n. We want to learn about the corresponding random variable X. We want to learn about the cumulative distribution function (CDF): F X (x) = Pr(X x), which gives Pr(a < X b) = F X (b) F X (a). If X is a continuous random variable, the CDF can be defined in terms of its probability density function (PDF) f X as follows: F X (x) = x f X (t) dt.

9 Example: the Normal (or Gaussian) distribution R: dnorm{base}+pnrom{base} The PDF of X N(µ, σ 2 ) is given by: f X (x) = 1 σ (x µ) 2 2π e 2σ 2, x R. There is no closed form for the CDF. PDF CDF µ: mean, σ 2 : variance, σ: standard deviation. N(0, 1): standard Normal distribution.

10 Moments R: moment + raw2central{moments} In practice, f x is unknown. Some statistics can be useful to describe the data. The l-th moment of X m l = E(X l ) = x l f X (x) dx; m 1 : mean. The l-th central moment of X m l = E(X m 1 ) l = (x m 1 ) l f X (x) dx. In practice approximations are needed; e.g. we use m 1 x = 1 n n i=1 x i to estimate m 1. R R

11 Features of a Distribution Described using Functions of Moments R: mean{base} + sd{stats} + (skewness + kurtosis){moments} Measure Feature Function of moment Mean Location µ = m 1 Standard deviation Scale σ = m 2 Skewness Asymmetry γ = m 3 /σ3 (Excess) Kurtosis Tailweight κ = m 4 /σ4 3

12 Skewness and kurtosis R: skewness{moments} + kurtosis{moments} Skewness: γ < 0 vs. γ > 0 Kurtosis κ > 0 vs. κ = 0 vs. κ < 0:

13 The previous statistics are useful as a first description of the data but they do not fully describe the data. For this, we would need to estimate the PDF f X. If we don t have any knowledge: nonparametric estimation, see lecture in two weeks. In the next slides, we are concerned with problems where we want to learn about of a certain output variable y i by using data from another (input) variable x i. For example, it may be of interest to understand in a class of n students how does the output variable study performance, evolves on average as a function of input variable, time spent on studying. This involves extensions of the concepts of mean, etc., where instead of being interested in learning simply about Y, we are interested in Y X = x.

14 Part III Mean Regression

15 Mean Regression: Introducing the linear model Suppose that we collect a random sample {(x i, y i ) : i = 1,..., n}. Assume that a linear specification gives a good approximation for describing the link between covariates x T i = (x 1,..., x k ) and the response variable y i, i.e. y i = x T i β + ε i, where ε i N(0, σ 2 ε ), for i = 1,..., n. Suppose for instance that we want to use the stature of parents (x i ) to predict the stature of their offspring (y i ). A simple linear regression model for modeling this data assumes y i = β 1 + β 2 x i + ε i, i = 1,..., n.

16 Mean Regression Regression towards mediocrity in hereditary stature (Galton, 1886) R: data(galton){usingr} Figure: Sir Francis Galton (left); scatterplot and regression line for an hereditary stature study (right). The conception of the model is however older having its roots in Theoria Motus Corporum Coelestium, a work due to the prominent mathematician Carl Gauss.

17 Mean Regression: What s Fixed? What s Random? In the next slides we follow a frequentist approach, and hence the parameter β is assumed to be an unknown but fixed quantity; later we will introduce the Bayesian approach, and there we replace this by the assumption that β is a random quantity. We prefer to assume that covariates x i are random, but there are other approaches where these are assumed to be fixed. We think that this assumption is preferable in many applied fields, and in particular it may be used to justify the idea that there may be some measurement errors in the covariates.

18 OLS (Ordinary Least Squares) Estimator The OLS estimator ( β 1, β 2 ) T defines the line such that the sum of the Euclidean distances between responses (y i ) and the ordinates of that line (β 1 + β 2 x i ), is minimal across all possible lines, i.e. β ( β1 β 2 ) = arg min (β 1,β 2 ) R 2 ( n (y i β 1 β 2 x i ) 2 = i=1 y β 2 x ni=1 (x i x)(y i y) ni=1 (x i x) 2 ). Here x and y denote the sample means of x i and y i. This is the same as the maximum likelihood estimator (more next week) since we assume that y i x i N(β 1 + β 2 x i, σ 2 ɛ )

19 OLS (Ordinary Least Squares) Estimator The predicted response is then given by ŷ i = β 1 + β 2 x i. This is the expected value of y i x i, since E(ɛ i ) = 0 The particular case where β 2 = 0 is instructive as the predicted response is simply the sample mean, i.e., ŷ i = y.

20 OLS (Ordinary Least Squares) Estimator Example In the hereditary stature data we have ( β = y β ) ( 1 x ni=1 (x i x)(y i y) ni=1 (x i x) / ) ( = ). The residuals of the fit, are the empirical counterparts of the error of the model, and are defined as e i = y i ŷ i. The assumption of normality of the errors can be tested using for instance the Jarque Bera statistic (R: jarque.bera.test{tseries}).

21 Fitting a Linear Model for Hereditary Stature Data R: lm{stats}= # Data are from the UsingR package # Uncomment the line below to install it # install.packages("usingr") # Load package, data, and attach data library(usingr) data(father.son) attach(father.son) # Mean regression fit <- lm(sheight~fheight) > fit Call: lm(formula = sheight ~ fheight) Coefficients: (Intercept) fheight The fitted model is thus son height i = father height i, i = 1,..., 1078.

22 Plotting the Hereditary Stature Data Against the Regression Line plot(sheight~fheight,bty = "l",pch = 20,xlab = "Father s Height", ylab="son s Height",cex.lab=1.4) abline(lm(sheight~fheight),lty=1,lwd=3,col="red") Son's Height Father's Height

23 Estimation Uncertainty Although we assume the parameters β = (β 1, β 2 ) are fixed, their estimators, β 1, β 2, are functions of the sample y x, which is random If we observed another sample of data from y x, then the parameter estimates would be different To quantify how different they might be, we look at the distribution of the estimators n 1/2 ( β β) N 2 (0, Σ). where Σ is a variance-covariance matrix depending on unknown quantities In practice therefore we estimate Σ

24 Estimation Uncertainty The variance covariance matrix of the OLS estimator is Σ β varcov( β) = σε 2 n i=1 x i 2 x i=1 (x i x) 2 n n n i=1 (x i x) 2 x 1 n i=1 (x i x) 2 n i=1 (x i x) 2 In practice to consistently estimate Σ β we plug in a consistent estimator of σ 2 ε in the expression above..

25 Estimation Uncertainty To make inference on the uncertainty in parameter estimates we use their asymptotic distribution t (κ) = β κ β κ N(0, 1), κ = 1, 2. var( β κ) This relies on var( β κ ) which is often unknown. The appropriate distribution is then t(κ) = β κ β κ t n 2, κ = 1, 2. var( β κ) (For large n, this is approximately Normal.)

26 Hypothesis testing The distributions of the estimators of β help us understand which range of values would be plausible if we were to observe another sample We can use the distributions to test hypotheses about the model Often of interest is the null hypothesis: H 0 : β 2 = 0 since this is equivalently the hypothesis that the covariate x does not influence the response y

27 Coefficient of Determination A central question is: does the model explain the variability in the response Y well? The coefficient of determination provides a simple diagnostic to the model fit r 2 = n i=1 (ŷ i y) 2 n i=1 (y i y) 2 = 1 n i=1 e2 i n i=1 (y i y) 2 = 1 error variation total variation in y. The smaller the error variation, relative to the total variation in y, the better the fit.

28 Back to the Hereditary Stature Data Example In the hereditary stature data we get since σ ε = ( Σ β = (n 2) 1 n i=1 e2 i ). Hypothesis testing: If our null hypothesis H 0 is that β 2 = 0, then t(2) = > Q (t 1076 ) = 1.96, and hence we reject H 0. The decision of rejecting the null hypothesis could also have been done by noticing that the p-value is smaller than 0.05, i.e. p-value = 2(1 F t1076 ( ) < 0.05 Model checking: r 2 = ni=1 (ŷ i y) 2 ni=1 (y i y) 2 = =

29 Assessing the Fitted Model in R > fit <- lm(sheight~fheight) > summary(fit) Call: lm(formula = sheight ~ fheight) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** fheight <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 1076 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 1 and 1076 DF, p-value: < 2.2e-16

30 Generalizing the Model: More Covariates Often we believe that the more than one covariate may be important, but the model can be easily generalized as y i = β 1 x i1 + + β k x ik + ε i, ε i N(0, σ 2 ). Inference can still be made in a similar way, but the number of degrees of freedom in the asymptotic distribution t(κ) = β κ β κ t n k, κ = 1,..., k. var( β κ ) In this more general setting we can take advantage of a matrix representation of the model.

31 Matrix Formulation Consider the following matrices where we store the data and the errors of the model y = y 1, X = x 11. x 1k., ε = ε 1.. y n The model can then be rewritten as x n1 x nk y = Xβ + ε, and to estimate the regression parameter β = (β 1,..., β k ) T we use the OLS estimator now defined as β = arg min β y Xβ 2 = (X T X) 1 X T y, where denotes the L 2 norm. Here the variance covariance matrix is Σ β varcov( β) = σ 2 ε(x T X) 1.. ε n

32 Mean Regression Theoretical reasons supporting the use of mean regression-based models The model is conceptually simple and computationally easy to implement. The case of heterocedasdicity (i.e. if σ 2 cannot be assumed to be constant over individuals) can lead to some complex problems, but overall the OLS estimator is unbiased, i.e., E( β X) = β. Under mild conditions it can be shown to be (strongly) consistent, i.e. β a.s. β, n. The Gauss Markov theorem provides strong theoretical support for using the OLS estimator in a linear regression model (Knight, 2000, p. 417). But...

33 Mean Regression Average effects Something to think about: Our linear specification and normality assumption imply that E(y i x i ) = x T i β. This assumes that y i can take any real value, and the errors conditionally upon knowledge of x T i β are continuous and normally distributed But: Not all data can take any value Not all data are continuous We need to generalize the framework for modelling different types of response data

34 Part IV Logistic Regression as an example of Generalized Linear Model

35 Introduction In many contexts the response variable of interest is categorical (e.g. Yes, No). Example: we ask n students how many hours they studied and whether they passed the exam. We want to learn the relationships between the output: passing the exam (y i s) and the input: hours of study (x i s). The application of the discussed linear regression models becomes inadequate. An alternative is provided by logistic regression which offers the possibility to model the probabilities of T classes, as functions of covariates, while obeying the constraint that these probabilities need to be in [0, 1], and sum to one. For simplicity, we focus here on the case of a single covariate and T = 2, but this can be adapted to a more general setting (Christensen, 1990, 4; Hastie et al, 4).

36 Logistic Regression Let p i denote a probability of failure in the ith trial. We wish to predict p i given the information available on a certain covariate x i. The model is based on the following log-linear specification ( ) pi logit(p i ) log = β 1 + β 2 x i, 1 p i where β 1 and β 2 respectively denote an intercept and a slope parameter. The quantity p i /(1 p i ) is the so-called odds ratio. logit( ) is the link function.

37 Maximum likelihood estimation Inference is typically done by maximum likelihood. Using the fact that the loglikelihood can be written as p i = exp(β 1 + β 2 x i ) 1 + exp(β 1 + β 2 x i ), l = n n log{p i I(failure i )} + log{(1 p i ) I(success i )}. i=1 i=1 In practice this likelihood can be optimized using a Newton Raphson algorithm (Hastie et al, 2001, p. 99), or a Monte Carlo stochastic optimization method (de Carvalho, 2011). A simple possibility is to code the function l in R, and for the optimization use the instruction optim{stats}.

38 Example: South African Heart Disease Data R: glm{stats} #install.packages(elemstatlearn) library(elemstatlearn)?saheart (...) Format: (...) A data frame with 462 observations on the following 10 variables. sbp systolic blood pressure tobacco cumulative tobacco (kg) ldl low density lipoprotein cholesterol famhist family history of heart disease, a factor with levels Absent Present obesity a numeric vector alcohol current alcohol consumption age age at onset chd response, coronary heart disease

39 Fitting a Logistic Model on South African Heart Disease Data R: glm{stats} #load and attach the data > data(saheart) > attach(saheart) # Logistic regression fit <- glm(chd~sbp+tobacco+ldl+famhist+obesity +alcohol+age, family=binomial) > summary(fit) Call: glm(formula = chd ~ sbp + tobacco + ldl + famhist + obesity + alcohol + age, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-05 *** sbp tobacco ** ldl ** famhistpresent e-05 *** obesity alcohol age e-05 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1)

40 Linear regression vs. logistic regression R: glm{stats} Linear regression: y i = x T i β + ε i, ε iid i N(0, σ 2 ). This is equivalent to Y i X = x. N(µ i (x), σ 2 ), µ i (x) = x T i β. Logistic regression: Y i X = x Bin(n i, p i (x)), logit(p i (x)) = x T i β. All are special case of so-called generalized linear model: Y i X = x F, g(e F (Y X = x)) = x T i β, for a given distribution F (in the exponential family) and a given bijective function g (the link function).

41 Generalized Linear Model R: glm{stats} Example (Some examples of generalized linear models) Y i X = x F, g(e F (Y X = x)) = x T i β. Name Distribution F Link function g Domain of Y Linear regression Gaussian identity continuous Logistic regression binomial logit discrete 0, 1,..., T Probit model binomial probit discrete 0, 1,..., T Poisson regression Poisson logarithm discrete 0, 1, 2,... Neg. binomial regression Neg. binomial logarithm discrete 0, 1, 2,... Exponential regression Exponential inverse continuous 0 Gamma regression Gamma inverse continuous 0 Inv. Gaussian regression Inv. Gaussian inverse squared continuous >

42 Take-Home Message Different regression models for different problems. What could be a suitable model for my problem? What are interesting explanatory variables? (more about this in the next week...) How much confidence can I put in my analysis? Robustness, power,... Are the underlying hypothesis of my model fulfilled? Maximum likelihood is a powerful principle of estimation and inference (more about this in the next week...).

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator