STATISTICS 3A03. Applied Regression Analysis with SAS. Angelo J. Canty

Similar documents
Simple Linear Regression

Problem 1 (20) Log-normal. f(x) Cauchy

Course information: Instructor: Tim Hanson, Leconte 219C, phone Office hours: Tuesday/Thursday 11-12, Wednesday 10-12, and by appointment.

Review of Statistics

Econometrics I G (Part I) Fall 2004

p(z)

Department of Statistical Science FIRST YEAR EXAM - SPRING 2017

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Psychology 282 Lecture #4 Outline Inferences in SLR

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Review of probability and statistics 1 / 31

Ch 2: Simple Linear Regression

Multiple Linear Regression

STAT5044: Regression and Anova. Inyoung Kim

STAB57: Quiz-1 Tutorial 1 (Show your work clearly) 1. random variable X has a continuous distribution for which the p.d.f.

First Year Examination Department of Statistics, University of Florida

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Machine Learning. Instructor: Pranjal Awasthi

Chapter 4. Chapter 4 sections

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III

Simple Linear Regression (Part 3)

LECTURE 1. Introduction to Econometrics

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Probability Theory and Statistics. Peter Jochumzen

Gaussians Linear Regression Bias-Variance Tradeoff

Bivariate distributions

f X, Y (x, y)dx (x), where f(x,y) is the joint pdf of X and Y. (x) dx

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

MAS223 Statistical Inference and Modelling Exercises

Lecture 10: Normal RV. Lisa Yan July 18, 2018

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

This paper is not to be removed from the Examination Halls

Statistics Introduction to Probability

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Probability and Statistics Notes

Stat 366 A1 (Fall 2006) Midterm Solutions (October 23) page 1

Fundamental Probability and Statistics

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

18 Bivariate normal distribution I

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Lecture 14 Simple Linear Regression

Multivariate Random Variable

Multivariate Statistics

Probability and Distributions

[y i α βx i ] 2 (2) Q = i=1

Covariance and Correlation

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Lecture The Sample Mean and the Sample Variance Under Assumption of Normality

The regression model with one stochastic regressor.

Math 3215 Intro. Probability & Statistics Summer 14. Homework 5: Due 7/3/14

Correlation analysis. Contents

Probability & Statistics - FALL 2008 FINAL EXAM

Statistics & Data Sciences: First Year Prelim Exam May 2018

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

The Central Limit Theorem

Chapter 5 Confidence Intervals

Math 362, Problem set 1

Statistics Introduction to Probability

Chapter 10. Hypothesis Testing (I)

Chapter 4 continued. Chapter 4 sections

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Stats Probability Theory

E X A M. Probability Theory and Stochastic Processes Date: December 13, 2016 Duration: 4 hours. Number of pages incl.

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Summary of Chapters 7-9

Correlation Analysis

Lecture 3. Inference about multivariate normal distribution

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

, 0 x < 2. a. Find the probability that the text is checked out for more than half an hour but less than an hour. = (1/2)2

Hypothesis Testing One Sample Tests

Week 1 Quantitative Analysis of Financial Markets Distributions A

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

Multivariate Regression

Inference and Regression

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Stat 704 Data Analysis I Probability Review

Recitation 2: Probability

Review: mostly probability and some statistics

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

ACE 562 Fall Lecture 2: Probability, Random Variables and Distributions. by Professor Scott H. Irwin

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Regression Models - Introduction

STA 2101/442 Assignment 2 1

Introduction to Simple Linear Regression

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

Measuring the fit of the model - SSR

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 11. Multivariate Normal theory

Lecture 5: Clustering, Linear Regression

Lecture 3: Inference in SLR

Probability and Stochastic Processes

Transcription:

STATISTICS 3A03 Applied Regression Analysis with SAS Angelo J. Canty

Office : Hamilton Hall 209 Phone : (905) 525-9140 extn 27079 E-mail : cantya@mcmaster.ca SAS Labs : L1 Friday 11:30 in BSB 249 L2 Tuesday 11:30 in BSB 244 L3 Thursday 8:30 in BSB 249 Office Hours : Monday 10:30-11:30 Tuesday 11:30-12:30 Thursday 10:30-11:30 or at other times by e-mail appointment. 1-1

Website : All lecture notes will be available on the website and any announcements about the course will be made there. http://www.math.mcmaster.ca/canty/teaching/stat3a03 Required Text : Regression Analysis By Example (5th Edition) Samprit Chatterjee & Ali S. Hadi. Wiley (2012) Optional (recommended) Text : Applied Linear Regression (4th Edition) Sanford Weisberg. Wiley (2014) 1-2

Distributions and Moments The distribution of a continuous random variable, Y, is usually described by a Probability Density Function, f Y (y). Two important characteristics of a distribution are the mean E(Y ) and the variance Var(Y ). These describe the central location and amount of spread in the distribution of the random variable Y. Two important rules for calculating moments are 1. E(aY + b) = ae(y ) + b for any constants a, b IR. 2. Var(aY + b) = a 2 Var(Y ) for any constants a, b IR. 1-3

The Normal Distribution The Normal or Gaussian distribution is commonly used to model naturally occurring phenomena. It is actually a family of distributions characterized by two parameters, the mean µ and the variance σ 2. The density function is given by f(y; µ, σ 2 ) = 1 σ 2π exp { (y µ)2 2σ 2 } < x < µ x 1-4

Random Vectors Two random variables X and Y have a joint distribution. In the continuous case there is a joint density function f(x, y). The marginal pdf for one variable alone can be found by integrating over the other. f X (x) = f Y (y) = f(x, y) dy f(x, y) dx 1-5

Conditional Distributions We are often interested in the distribution of Y given the additional information that X = x for some value x. This gives rise to the conditional distribution of Y given X = x. The conditional density is f Y X (y x) = f(x, y) f X (x) In this course we will be particularly interested in modelling the conditional expected value E(Y X = x) = yf Y X(y x) dy This will be a function of the value of x used. 1-6

Random Samples In statistics we deal with random samples. A set of random variables Y 1,..., Y n is a random sample if 1. All of the random variables have the same marginal distribution f Y (y). 2. The set of random variables is independent f(y 1,..., y n ) = f Y (y 1 )f Y (y 2 ) f Y (y n ) A random sample is also called an independent and identically distributed (iid) set of random variables. 1-7

Moments of Linear Combinations Many important estimators are linear combinations of independent random variables. The moments of these estimators can often be found using the following theorem. Theorem 1 If Y 1,..., Y n are independent random variables and a 1,..., a n are real constants then 1. E n i=1 a i Y i = n i=1 a i E(Y i ) 2. Var n i=1 a i Y i = n i=1 a 2 i Var(Y i). 1-8

Statistical Inference In statistical inference we have a random sample Y 1,..., Y n with common density f(y; θ). θ is a parameter of the distribution whose value is unknown. We wish to use the sample to make inferences about θ. The three types of inference we make are 1. Estimation of θ. 2. Confidence Intervals for θ. 3. Hypothesis Testing that θ takes on a certain value. 1-9

Inference for Normal Distributions The sample mean Y = 1 n is the usual estimator for µ. n Y i i=1 The sample variance S 2 = 1 n 1 is the usual estimator for σ 2. n i=1 (Y i Y ) 2 These are both unbiased estimators in that E(Y ) = µ and E(S 2 ) = σ 2 1-10

Inference for Normal Distributions (ctd) The expectations are taken relative to the distributions of Y and S 2 when considering repeated sampling from the normal distribution. These are referred to as the Sampling Distributions of the estimators. If Y 1,..., Y n are iid N(µ, σ 2 ) then the sampling distribution of Y is Y N(µ, σ 2 /n). When σ 2 is known we can use n(y µ) Z = σ to make inference. N(0, 1) 1-11

Inference for Normal Distributions (ctd) Usually (and always in this course) σ 2 is unknown. For that reason we use for inference. T = n(y µ) S t n 1 t n 1 denote s the Student s t distribution with n 1 degrees of freedom. Symmetric distribution about 0 but with heavier tails than the standard normal to account for the variability in S as an estimator of σ. 1-12

Inference for Normal Distributions (ctd) A confidence interval is a random interval which contains the true value of the parameter θ for a large percentage (usually 95%) of samples. A confidence interval for the mean is y ± t n 1,α/2 s n In hypothesis testing we specify a null value for the parameter (often, but not always, 0) and see if the data supports this value. Evidence against the null hypothesis is usually based on a p-value. If the p-value is small we say that there is evidence against the null hypothesis. 1-13

Introduction to Regression The aim of regression is to model the dependence of one variable Y on a set of variables X 1,..., X p. Y is called the dependent variable or the response variable. X 1,..., X p are called the independent variables or covariates. In this course Y will be a continuous or quantitative variable but the covariates may be continuous or discrete. 1-14

The General Regression Model A general model for predicting Y given the covariates X 1,..., X p would be Y = f(x 1,..., X p ) + ε The term ε is usually called the Random Error and explains the variability of the random variable Y about f(x 1,..., X p ). If we specify that E(ε) = 0, Var(ε) = σ 2 and that ε is independent of the covariates then we see that this implies E(Y X 1 = x 1, X 2 = x 2,..., X p = x p ) = f(x 1, x 2,..., x p ) Var(Y X 1 = x 1, X 2 = x 2,..., X p = x p ) = σ 2 In most modelling situations the form of f will be determined by the analyst and it will typically depend on a set of unknown parameters β 0, β 1,..., β p. 1-15

The Linear Regression Model In this course we will deal with the situation where the parameters enter the function f in a linear way. This results in the linear model Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p + ε Furthermore we will generally assume that the random error ε is normally distributed with mean 0 and unknown variance σ 2 and is independent of the covariates. The aim of regression is then to make inferences about the p + 2 unknown parameters in this model. 1-16

Example: Fish age and Length The growth pattern of game fish, such as the smallmouth bass, is of interest to agencies who manage stocks in inland lakes. In this example 439 smallmouth bass, of at least one year old, were caught in West Bearskin Lake in Minnesota. For each fish, their age was measured using annular rings on their scales. Their length was also measured. We wish to fit the model Length = β 0 + β 1 Age + ε 1-17

Example: Fish age and Length 1 2 3 4 5 6 7 8 50 100 150 200 250 300 350 Age Length 1-18

Example: Fish age and Length The red dots are the mean lengths examining only fish of a given age. The blue line is the fitted regression line Length = 65.53 + 30.32 Age The blue line seems to follow the mean ages very well. Individual fish, however, vary about this line quite a bit. The fitted line tells us that, on average, fish grow 30.32mm per year after 1 year of age. 1-19

Example: Mother and Daughter Heights How does a mother s height affect her daughter s height? We would expect tall mothers to have tall daughters and short mothers to have short daughters. Do daughters tend to be the same height as their mother, on average? To answer this question, the famous statistician Karl Pearson recorded the heights of 1375 British mothers and their adult daughters in the period 1893 1898. The data were published in 1903 in one of the first publications to look at heredity of physical traits using real data. 1-20

Example: Mother and Daughter Heights 55 60 65 70 55 60 65 70 Mother's Height Daughter's Height 1-21

Example: Mother and Daughter Heights The red line is the line Daughter Height = Mother Height The blue line is the fitted line Daughter Height = 29.92 + 0.54 Mother Height The fitted line shows that short mothers have shorter than average daughters but they tend to be taller than their mothers. Conversely, tall mothers have taller than average daughters but they are shorter than their mothers. This phenomenon is known as Regression to the Mean. 1-22

The Regression Process 1. The researcher must clearly define the question(s) of interest in the study. 2. The response variable Y must be decided on, based on the question of interest. 3. A set of potentially relevant covariates, which can be measured, needs to be defined. 4. Data is collected. 1-23

Data Collection In some cases, called designed experiments the data can be collected in a controlled setting which will hold constant variables which are not of interest. Controlled experiments also facilitate setting certain values of the covariates which are of interest to us. These types of experiments are common in areas such as industrial process control, animal models for medical research etc. In many studies, however, the data are collected by choosing a random sample of n individuals and observing Y and X 1,..., X p for those individuals. 1-24

Data Collection We generally assume that each individual selected is independent of all others. In that case we have a random sample of data which can be organized as Subject Y X 1 X 2 X p 1 y 1 x 11 x 12 x 1p 2 y 2 x 21 x 22 x 2p...... n y n x n1 x n2 x np 1-25

The Regression Process (continued) 5. Model Specification. What is the form of the model? What assumptions will we make? 6. Decide on a method for fitting the specified model. 7. Fit the model - typically using software such as SAS. 8. Examine the fitted model for violations of assumptions. 9. Conduct hypothesis testing for questions of interest. 10. Report the results from statistical inference. 1-26