Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

Similar documents
Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

Linear Regression Models

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

1 Inferential Methods for Correlation and Regression Analysis

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

Regression, Inference, and Model Building

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

University of California, Los Angeles Department of Statistics. Simple regression analysis

(all terms are scalars).the minimization is clearer in sum notation:

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Properties and Hypothesis Testing

Describing the Relation between Two Variables

Common Large/Small Sample Tests 1/55

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Stat 139 Homework 7 Solutions, Fall 2015

Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Correlation Regression

Simple Linear Regression

Mathematical Notation Math Introduction to Applied Statistics

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

Final Examination Solutions 17/6/2010

Grant MacEwan University STAT 252 Dr. Karen Buro Formula Sheet

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

STA6938-Logistic Regression Model

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Statistics 20: Final Exam Solutions Summer Session 2007

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Question 1: Exercise 8.2

Agenda: Recap. Lecture. Chapter 12. Homework. Chapt 12 #1, 2, 3 SAS Problems 3 & 4 by hand. Marquette University MATH 4740/MSCS 5740

Lesson 11: Simple Linear Regression

Chapter 20. Comparing Two Proportions. BPS - 5th Ed. Chapter 20 1

Stat 200 -Testing Summary Page 1

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

TAMS24: Notations and Formulas

Comparing your lab results with the others by one-way ANOVA

Section 14. Simple linear regression.

Lecture 11 Simple Linear Regression

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Ismor Fischer, 1/11/

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Sample Size Determination (Two or More Samples)

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

REVIEW OF SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION

ECON 3150/4150, Spring term Lecture 3

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Tables and Formulas for Sullivan, Fundamentals of Statistics, 2e Pearson Education, Inc.

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

Statistics. Chapter 10 Two-Sample Tests. Copyright 2013 Pearson Education, Inc. publishing as Prentice Hall. Chap 10-1

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Simple Linear Regression

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

11 Correlation and Regression

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

Lesson 2. Projects and Hand-ins. Hypothesis testing Chaptre 3. { } x=172.0 = 3.67

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

STP 226 EXAMPLE EXAM #1

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Regression, Part I. A) Correlation describes the relationship between two variables, where neither is independent or a predictor.

1 Models for Matched Pairs

(7 One- and Two-Sample Estimation Problem )

[ ] ( ) ( ) [ ] ( ) 1 [ ] [ ] Sums of Random Variables Y = a 1 X 1 + a 2 X 2 + +a n X n The expected value of Y is:

Regression and correlation

Correlation and Regression

Lecture 5: Parametric Hypothesis Testing: Comparing Means. GENOME 560, Spring 2016 Doug Fowler, GS

Design of Engineering Experiments Chapter 2 Basic Statistical Concepts

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

Statistical Properties of OLS estimators

University of California, Los Angeles Department of Statistics. Hypothesis testing

Class 27. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Logit regression Logit regression

Nonlinear regression

(X i X)(Y i Y ) = 1 n

9. Simple linear regression G2.1) Show that the vector of residuals e = Y Ŷ has the covariance matrix (I X(X T X) 1 X T )σ 2.

z is the upper tail critical value from the normal distribution

n but for a small sample of the population, the mean is defined as: n 2. For a lognormal distribution, the median equals the mean.

Linear Regression Models, OLS, Assumptions and Properties

Working with Two Populations. Comparing Two Means

Simple Random Sampling!

Chapter 13, Part A Analysis of Variance and Experimental Design

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Topic 9: Sampling Distributions of Estimators

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

STATISTICAL INFERENCE

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Transcription:

Regressio Correlatio vs. regressio Predicts Y from X Liear regressio assumes that the relatioship betwee X ad Y ca be described by a lie Regressio assumes... Radom sample Y is ormally distributed with equal variace for all values of X The parameters of liear regressio Y = α + β X

Positive β β = 0 Estimatig a regressio lie Higher α Y = a + b X Negative β Lower α Nomeclature Fidig the "least squares" regressio lie Residual: Y i ˆ Y i Miimize: SS residual = i =1 ( Y i Y ˆ i )

Best estimate of the slope b = i =1 ( X i X ) Y i Y i =1 ( ) ( X i X ) (= "Sum of cross products" over "Sum of squares of X") Remember the shortcuts: X $ ' i Y i ( X i X )( Y i Y ) = & X i Y i ) i =1 % ( ( X i X ) = X i i =1 ( ) $ ' & X i ) % ( Fidig a Y = a + bx So.. a = Y bx Example: Predictig age based o radioactivity i teeth May above groud uclear bomb tests i the 50s ad 60s may have left a radioactive sigal i developig teeth. Is it possible to predict a perso s age based o detal C 14? Data from 1965 to preset from Spaldig et al. 005. Foresics: age writte i teeth by uclear tests. Nature 437: 333 334.

Teeth data: Teeth data: Δ 14 C Date of Birth 89 1985.5 109 1983.5 91 1990.5 17 1987.5 99 1990.5 110 1984.5 13 1983.5 105 1989.5 Δ 14 C Date of Birth 6 1963.5 6 1971.7 471 1963.7 11 1990.5 85 1975 439 1970. 363 197.6 391 1971.8 Let X be the estimated age, ad Y be the actual age. X = 3798, Y = 31674 X =1340776, ( XY ) = 74953 Y = 670404 =16 X = 37.375 Y =1979.63 # & ( X i X) ( Y i Y ) = % X i Y i ( $ ' i=1 i=1 = 74953 3798 # & % X i ( ( X i X) $ ' = ( X i ) X i Y i ( )( 31674) 16 ( =1340776 3798 ) = 4396 16 = 3393 Calculatig a a = Y bx = 1979.63 ( 0.053)37.375 = 199. b = 3393 4396 = 0.053

ˆ Y =199. 0.053X Predictig Y from X If a cadaver has a tooth with Δ 14 C cotet equal to 00, what does the regressio lie predict its year of birth to be?! ˆ Y =199. 0.053X ( ) =199. 0.053 00 =1981.6 r predicts the amout of variace i Y explaied by the regressio lie r is the coefficiet of determiatio: it is the square of the correlatio coefficiet r

Cautio: It is uwise to extrapolate beyod the rage of the data. More data o fish i desert pools Number of species of fish as predicted by the area of a desert pool If we were to extrapolate to ask how may species might be i a pool of 50000m, we would guess about 0. Log trasformed data: Testig hypotheses about regressio H 0 : β = 0 H A : β 0

Sums of squares for regressio SS Total = Y i SS regressio = b $ ' & Y i ) % ( ( X i X ) Y i Y SS residual + SS regressio = SS Total ( ) With - degrees of freedom for the residual Radioactive teeth: Sums of squares SS Total = Y i # & % Y i ( $ ' ( = 670404 31674 ) 16 SS regressio = b ( X i X) Y i Y ( ) =1339.75 = ( 0.053) ( 3393) =139.8 Teeth: Sums of squares Calculatig residual mea squares SS residual = SS Total SS regressio = 1339.75 139.8 = 99.9 df residual =16 =14 MS residual = SS residual / df residual MS residual = 99.9 14 = 7.1

Stadard error of a slope b has a t distributio SE b = = MS residual ( X i X) 7.1 4396 = 0.004 Cofidece iterval for a slope: b ± t α[],df SE b Hypothesis tests ca use t: t = b β 0 SE b Example: 95% cofidece iterval for slope with teeth example Cofidece bads: cofidece itervals for predictios of mea Y b ± t α[],df SE b = b ± t 0.05[],14 SE b = 0.053±.14( 0.004) = 0.053± 0.0018

Predictio itervals: cofidece itervals for predictios of idividual Y Hypothesis tests o slopes H 0 : β = 0 H A : β 0 t = b β 0 SE b t = 0.053 0 0.004 =13.5 t 0.0001(),14 = ±5.36 So we ca reject H 0, P<0.0001 No-liear relatioships Trasformatios Trasformatios Quadratic regressio If Y = ax b the ly = la + bl X. If Y = ab X the ly = la + X lb. Splies If Y = a + b X the set X " = 1, ad calculatey = a + b " X X. All of the equatios o the right have the form Y=a+bX.

No-liear relatioship: Number of fish species vs. Size of desert pool Residual plots help assess assumptios Origial: Residual plot Logs: Trasformed data Residual plot Polyomial regressio Number of species = 0.046 + 0.185 Biomass - 0.00044 Biomass

Do ot fit a polyomial with too may terms (the sample size should be at least 7 times the umber of terms) Comparig two slopes Example: Comparig species-area curves for islads to those of mailad populatios Log10(Number of species) By Log 10(Area of "islad").5 Log10(Number of species).0 1.5 1.0 0.5-1 0 1 3 4 5 6 7 Log 10(Area of "islad") Liear Fit Type of islad=i Liear Fit Type of islad=m Liear Fit Type of islad=i Hypotheses H 0 : β M = β I. H A : β M β I. The error i the differece of two slopes is ormally distributed. ( ) ( β 1 β ) t = b 1 b SE b1 b df = 1 - + -

( ( MS error ) p = SS error) 1 + ( SS error ) ( DF error ) 1 + ( DF error ) Aalysis of covariace (ANCOVA) Compares may slopes H 0 : β 1 = β = β 3 = β 4 = β 5 SE b1 b = ( MS error ) p 1 ( MS error ) p + $ ' $ ' ( X X ) & ) ( X X ) & ) % ( % ( H A : At least oe of the slopes is differet from aother. Logistic regressio Tests for relatioship betwee a umerical variable (as the explaatory variable) ad a biary variable (as the respose). e.g.: Does the dose of a toxi affect probability of survival? Does the legth of a peacock's tail affect its probability of gettig a mate?