UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018

Similar documents
Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Factorial designs. Experiments

Ph.D. Preliminary Examination Statistics June 2, 2014

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Categorical Predictor Variables

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Multiple Regression Examples

Chapter 1 Statistical Inference

STAT 461/561- Assignments, Year 2015

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

First Year Examination Department of Statistics, University of Florida

STA 2201/442 Assignment 2

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Statistics 5100 Spring 2018 Exam 1

Stat 5102 Final Exam May 14, 2015

Statistics II Exercises Chapter 5

Applied Statistics Preliminary Examination Theory of Linear Models August 2017

STAT 525 Fall Final exam. Tuesday December 14, 2010

Master s Written Examination

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Swarthmore Honors Exam 2012: Statistics

STAT 526 Spring Final Exam. Thursday May 5, 2011

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Generalized Linear Models 1

Single-level Models for Binary Responses

Analysis of Variance

The material for categorical data follows Agresti closely.

Generalized Linear Models

Correlation and Regression

First Year Examination Department of Statistics, University of Florida

2.1 Linear regression with matrices

Statistics & Data Sciences: First Year Prelim Exam May 2018

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Exam Applied Statistical Regression. Good Luck!

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

PhD Qualifying Examination Department of Statistics, University of Florida

Correlation and Linear Regression

STA 302f16 Assignment Five 1

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Mathematical Notation Math Introduction to Applied Statistics

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Sociology 6Z03 Review II

Simple and Multiple Linear Regression

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Multiple Linear Regression

Unit 6 - Simple linear regression

MS&E 226: Small Data

Masters Comprehensive Examination Department of Statistics, University of Florida

Sample questions for Fundamentals of Machine Learning 2018

Sociology 593 Exam 2 Answer Key March 28, 2002

Validation of Visual Statistical Inference, with Application to Linear Models

Profile Analysis Multivariate Regression

Linear Regression Models P8111

Comprehensive Examination Quantitative Methods Spring, 2018

Lecture 10 Multiple Linear Regression

Homework 2: Simple Linear Regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Chapter 12 - Lecture 2 Inferences about regression coefficient

Generalized linear models

Outline of GLMs. Definitions

Correlation & Simple Regression

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

8 Nominal and Ordinal Logistic Regression

Section 4.6 Simple Linear Regression

Inference for Regression

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

WISE International Masters

STAT 4385 Topic 03: Simple Linear Regression

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

MS&E 226: Small Data

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

ECON Introductory Econometrics. Lecture 2: Review of Statistics

Single Sample Means. SOCY601 Alan Neustadtl

Business Statistics. Lecture 10: Correlation and Linear Regression

STA 431s17 Assignment Eight 1

Statistics 135 Fall 2008 Final Exam

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

ACOVA and Interactions

Simple Linear Regression: One Qualitative IV

Ch 2: Simple Linear Regression

Generalized Linear Models. Kurt Hornik

1 Appendix A: Matrix Algebra

Simple Linear Regression

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Transcription:

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018 Work all problems. 60 points are needed to pass at the Masters Level and 75 to pass at the Ph.D. level. 1. (23 points) Consider the four pairs plots below. Each is generated as follows: X 1 N(0, 1). Then X 2 is sampled in one of four ways, and Y is generated from X 1 and X 2 in one of four ways. The functions for X 2 and Y are given in the table below. Indicate which set of plots corresponds to each row of the table. Then indicate in which cases: X 1 and X 2 are related X 1 and X 2 are correlated There is an interaction between X 1 and X 2 There is a nonlinear relationship between Y and the X s Then answer the following question. X 2 = E(Y ) = A,B,C or D Related? Correlated? Interact? Nonlinear? (a) 1(X 1i + τ i > 0) X 1 X 1 X 2 τ i N(0, 1) (b) N(0, 1) X 1 X 2 (c) Binom(1,.5) e X 1 X 2 (d) X1i 2 + τ i X 1 X 2 τ i N(0,.3) Note that Y i N(E(Y i ),.3), and 1(K) = { 0 K false 1 K true (e) For which of the above cases would fitting a standard linear model (including interactions but not transformations) be inappropriate? In this case, would fitting a standard linear model of the form log(y ) X 1 + X 2 be appropriate? Why or why not? 1

ya 8 6 4 2 0 x1 8 6 4 2 0 0 1 2 3 4 5 0 1 2 3 4 5 x2a yb 3 x1 3 1.5 0.5 0.5 1.0 1.5 1.5 0.5 0.5 1.0 1.5 x2b yc 0 1 2 3 4 x1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x2c yd 2.5 1.5 0.5 0.5 x1 2.5 1.5 0.5 0.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x2d 2

2. The information below relates y, a second measurement on wood volume, to x 1, a first measurement on wood volume, x 2, the number of trees, x 3, the average age of trees, and x 4, the average volume per tree. Note that x 4 = x 1 /x 2. Some of the information has not been reported, so that you can figure it out on your own. Predictor ˆβk SE( ˆβ k ) t P Constant 23.45 14.90 0.122 x 1 0.93209 0.08602 0.000 x 2 0.4721 1.5554 0.126 x 3-0.4982 0.1520 0.002 x 4 3.486 2.274 0.132 Analysis of Variance Source df SS M S F P Regression 4 887994 0.000 Error Total 54 902773 Source df Sequential SS x 1 1 883880 x 2 1 183 x 3 1 3237 x 4 1 694 (a) (3 points) How many observations are in the data? (b) (3 points) What is R 2 for this model? (c) (3 points) What is the mean squared error? (d) (3 points) Give a 95% confidence interval for β 2. (e) (3 points) Test the null hypothesis β 3 =0 with α =.05 (use a two-sided test). (f) (3 points) Give the F statistic for testing the null hypothesis β 3 =0. (g) (3 points) Set up the test of the model with only variables x 1 and x 2 against the model with all of variables x 1, x 2, x 3, and x 4. Be sure to calculate the test statistic and describe which quantity you would read from a table or from R, including any necessary degrees of freedom. (h) (3 points) Consider testing the model with only variables x 1 and x 2 against the model with variables x 1, x 2, and x 3. Should this test be the same as the one in the previous part? Why or why not? (i) (3 points) For estimating the point on the regression surface at (x 1, x 2, x 3, x 4 ) = (100,25,50,4), the standard error of the estimate for the point on the surface is 2.62. Give the estimated point on the surface, a 95% confidence interval for the point on the surface, and a 95% prediction interval for a new point with these x values. (j) (3 points) Test the null hypothesis β 1 = β 2 = β 3 = β 4 = 0 with α = 0.5. 3

3. Fred recruited 150 subjects from a local shopping mall, all age 18 to 70, asked them their sex, age, and years of education, and had them take a test of reaction time. The following variables were therefore available for each subject: sex age education time 0 = male, 1 = female The subject s age, in years Years of education: eg, 16 = bachelor s degree but no higher Average reaction time in the test, in seconds (lower is better) Pairwise scatterplots of age, education, and time are shown in the figure below, with sex shown by the symbol used, and values for education randomly jittered to avoid overlap. Fred fit a linear regression model for time with the other three variables as covariates, using the standard least squares method (Output 1). Fred thought that more education might indicate greater intelligence, and that greater intelligence and better (ie, lower) reaction times might go together. Fred fit another regression model with only education as a covariate, (Output 2). Fred also fit a regression model with only age as a covariate (Output 3). Answer the following questions about this study: (a) (5 points) Is this an experiment or an observational study? What does your answer say about the ability of this study to answer questions about the causal factors influencing reaction time? Discuss this with reference to the specific variables measured in this study. (b) (5 points) Do you see any unusual data points? What would be an appropriate action to take with regard to any such data points? Does it appear that the regression model fits have been strongly influenced by any such unusual point or points? (c) (5 points) What conclusions do you think Fred should draw from these data? Discuss what can and cannot be concluded from the plots and model fits in output 1, 2, and 3, as well as your answers to (a) and (b). Discuss for each of the covariates (sex, age, and education) whether or not there is good reason to believe that that covariate is associated with time. In particular, discuss what the data show regarding Fred s idea that greater years of education might be associated with lower reaction time. 4

Output 1 Output 2 5

Output 3 6

4. The general form of an over-dispersed exponential family distribution is given by: ( ) b(θ)t (z) A(θ) f Z (z θ, τ) = h(z, τ) exp, (1) d(τ) where z represents the data, θ is the parameter vector of interest, and τ is the dispersion parameter. We will consider this form as a generalization of the Poisson regression framework, in which: λ Y x = E(Y x) = e βt x Y x Poisson(λ Y x ). (2) Note that in the general exponential family expression (1), z represents all of the data, while in the regression expression (2), the data are partitioned into a vector of outcome variables y and a matrix of predictors, x, which are considered fixed. Since z only represents the random variables, you may consider z = y, and treat x as fixed constants. (a) (5 points) Write the distribution for the Poisson regression model in (2) in the form in (1). Give expressions for b(θ), T (y), A(θ), d(τ), and h(y, τ) in terms of y, x, and β. You may do this for a single observation (rather than for a data set of many observations). (b) (3 points) The dispersion parameter allows a distribution to have more or less variance, for fixed mean. Knowing what you know about the Poisson distribution and Poission regression, why might this be helpful? (c) (6 points) Consider the following examples where researchers would like to use Poission regression as in (2). For each of the examples, consider whether you feel Poisson regression is appropriate. Choose the two examples which are most worrisome for using Poisson regression. For each of these, describe your concerns. (i) An environmental scientist counts bird nests in 1-acre plots in many areas of Massachusetts, and models the count of bird nests as a function of features of the local environment. (ii) A meteorologist counts the hours per day of rain (that is, the number of onehour blocks in which there is more than 0 rain, so each day the count is an integer between 0 and 24) in Amherst and each of 6 surrounding towns for each day in 2017. She models the count of hours of rain as a function of daily high and low temperature, as well as features of the towns. (iii) A physiologist uses 3 tests of balance to evaluate 30 research subjects. She models the number of tests passed as a function of individual characteristics. 7

5. Let X 1,..., X n be independent and identically distributed samples according to the Pareto distribution with parameters a and b. Suppose that b is known to be 1. We would like to compare the mean squared error (MSE) of two estimators of a, i.e., the maximum likelihood estimator (MLE) and the moment estimator. Note for a given sample x 1,..., x n, the maximum likelihood estimator â 1 = n/(log x 1 +... + log x n ) and the moment estimator is â 2 = 1/(1 1/ x) with x = n i=1 x i/n. Also note the Pareto distribution is related to the Exponential distribution, that is, if U follows an Exponential distribution with mean 1/a, then X = b exp(u) follows a Pareto distribution with parameters a and b. (a) (10 points) Design a simulation study to compare the MSE of the MLE and the moment estimator of a when b = 1. To answer this part, please provide a conceptual outline of the steps needed for the simulation study. (b) Write R code to implement the simulation. (5 points for workable code, 3 points for elegance and efficiency of the code.) hint: You may wish to use the function rexp(n, rate=[rate]). There is no R function to sample from a Pareto distribution directly. 8