Stat 101: Lecture 6. Summer 2006

Similar documents
6.0 Lesson Plan. Answer Questions. Regression. Transformation. Extrapolation. Residuals

7.0 Lesson Plan. Regression. Residuals

AMS 7 Correlation and Regression Lecture 8

Linear Regression and Correlation. February 11, 2009

Unit 6 - Introduction to linear regression

Announcements. Lecture 10: Relationship between Measurement Variables. Poverty vs. HS graduate rate. Response vs. explanatory

Correlation and Regression

Unit 6 - Simple linear regression

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Chapter 7. Scatterplots, Association, and Correlation

Stat 101 Exam 1 Important Formulas and Concepts 1

3.2: Least Squares Regressions

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Exam Applied Statistical Regression. Good Luck!

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Sociology 6Z03 Review I

MATH 1150 Chapter 2 Notation and Terminology

Statistical View of Least Squares

Chapter 5 Friday, May 21st

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

appstats8.notebook October 11, 2016

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Review. Midterm Exam. Midterm Review. May 6th, 2015 AMS-UCSC. Spring Session 1 (Midterm Review) AMS-5 May 6th, / 24

The following formulas related to this topic are provided on the formula sheet:

Important note: Transcripts are not substitutes for textbook assignments. 1

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

15.0 Linear Regression

Scatterplots and Correlation

11 Correlation and Regression

Inference for Regression

Summarizing Data: Paired Quantitative Data

Relationships Regression

Lecture 18: Simple Linear Regression

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

MATH 2560 C F03 Elementary Statistics I LECTURE 9: Least-Squares Regression Line and Equation

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Lecture 11: Simple Linear Regression

Basic Business Statistics 6 th Edition

18.0 Multiple and Nonlinear Regression

Introduction and Single Predictor Regression. Correlation

Linear Regression Communication, skills, and understanding Calculator Use

STAT5044: Regression and Anova. Inyoung Kim

Chapter 3. Measuring data

Correlation Analysis

9. Linear Regression and Correlation

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Descriptive Univariate Statistics and Bivariate Correlation

Looking at data: relationships

Review of Statistics 101

Business Statistics. Lecture 10: Course Review

AP Final Review II Exploring Data (20% 30%)

Chapter 2: Looking at Data Relationships (Part 3)

Chapter 1. Looking at Data

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Chapter 3: Describing Relationships

Statistical View of Least Squares

Correlation & Simple Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Chapter 7 Linear Regression

Lecture 48 Sections Mon, Nov 16, 2009

Least Squares Regression

2. Outliers and inference for regression

Determine is the equation of the LSRL. Determine is the equation of the LSRL of Customers in line and seconds to check out.. Chapter 3, Section 2

Chapter 5. Understanding and Comparing. Distributions

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

Statistics 100 Exam 2 March 8, 2017

Mrs. Poyner/Mr. Page Chapter 3 page 1

AP Statistics Two-Variable Data Analysis

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Analysis of Bivariate Data

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Lecture 2 and Lecture 3

Nov 13 AP STAT. 1. Check/rev HW 2. Review/recap of notes 3. HW: pg #5,7,8,9,11 and read/notes pg smartboad notes ch 3.

Chapter 3: Examining Relationships

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

1. Use Scenario 3-1. In this study, the response variable is

Business Statistics. Lecture 9: Simple Regression

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Math 147 Lecture Notes: Lecture 12

UNIT 12 ~ More About Regression

Chapter 5 Least Squares Regression

BIVARIATE DATA data for two variables

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Ch Inference for Linear Regression

Big Data Analysis with Apache Spark UC#BERKELEY

Simple Linear Regression

Correlation and simple linear regression S5

MODELING. Simple Linear Regression. Want More Stats??? Crickets and Temperature. Crickets and Temperature 4/16/2015. Linear Model

Chapter 14. Linear least squares

Data Analysis and Statistical Methods Statistics 651

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Correlation & Regression. Dr. Moataza Mahmoud Abdel Wahab Lecturer of Biostatistics High Institute of Public Health University of Alexandria

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

Transcription:

Stat 101: Lecture 6 Summer 2006

Outline Review and Questions Example for regression Transformations, Extrapolations, and Residual Review

Mathematical model for regression Each point (X i, Y i ) in the scatterplot satisfies: Y i = a + bx i + ɛ i ɛ i N(0, sd = σ). σ is usually unknown. The ɛ s have nothing to do with one another (independent). e.g., big ɛ i does not imply big ɛ j. We know X i s exactly. This imply that all error occurs in the vertical direction.

Estimating the regression line e i = Y i (a + bx i ) is called residuals. It measures the vertical distance from a point to the regression line. One estimates â and ˆb by minimizing, f (a, b) = n (Y i (a + bx i )) 2 i=1 Take the derivative of f (a, b) w.r.t a and b, and set them to 0, we get, â = Ȳ ˆb X; ˆb = 1 n n 1 X iy i XȲ n 1 X i 2 X 2 f (a, b) is also referred as Sum of Squared Errors (SSE). 1 n

An example A biologist wants to predict brain weight from body weight, based on a sample of 62 mammals. A scatter plot shows below. Ecological correlation?

The regression equation is, Y = 90.996 + 0.966X The correlation is 0.9344. But it is heavily influenced by a few outliers. The sd of the residuals is 334.721. This stands for the typical distance of a point to the regression line in the vertical direction. Under the Parameter Estimates portion of the printout, the last column tells whether the intercept and slope are significantly different from 0. Small numbers indicate significant differences; values less than 0.05 are usually taken to indicate real differences from zero, as opposed to chance errors.

The root mean square (RMSE) is the standard deviation of the vertical distances between each point and the estimated line. It is an estimate of the standard deviation of the vertical distances between the observations and the true line. Formally, RMSE = 1 n n 1 ( Y i (â + ˆbX ) ) 2 i Note that â + ˆbX i is the mean of the Y-value at X i.

The regression line predicts the average value for the Y values at a given X. In practice, one wants to predict the individual value for a particular value of X. e.g. if my weight is 50 (kg), then how much would my brain weigh? The prediction (g) is, log Ŷ = â + ˆb log X = 90.96 + 0.9665 50 = 98.325 But this is just the average for all mammals who weigh as much as I do.

The individual value is less exact than the average value. To predict the average value, the only source of uncertainty is the exact location of the regression line (i.e. â, ˆb are estimates of the true intercept and slope.) In order to predict my brainweight, the uncertainty about my deviation from the average is added to the uncertainty about the location of the line. For example, if I weights 50 (kg), then my brain should weigh 98.325(g) + ɛ. Assuming the regression model is correct, then ɛ has a normal distribution with mean zero and standard deviation 334.721. Note: with this model, my brain could easily have negative weight. This could make us question the regression assumptions.

Transformations The scatterplot of the brainweight against body weight showed the line was probably controlled by a few large values (high-leverage points). Even worse, the scatterplot did not resemble the football-shaped point cloud that supports the regression assumptions listed before. In cases like this, one can consider making a transformation of the response variable or the explanatory variable or both. For this data, consider taking the logarithm (10 base) of the brainweight and the body weight. The scatterplot is much better.

Taking the log shows that the outliers are not surprising. The regression equation is now: log Y = 0.908 + 0.763 log X Now 91.23% of the variation in brain weight is explained by body weight. Both the intercept and the slope are highly significant. The estimated sd of ɛ is 0.317. This is the typical vertical distance between a point and the line. Makeing transformations is an art. here the analysis suggests that, Y = 8.1 X 0.763 So there is a power-law relationship between brain mass and body mass.

Extrapolation Predicting Y values for X values outside the range of X values observed in the data is called extrapolation. This is risky, because you have no evidence that the linear relationship you have seen in the scatterplot continues to hold in the new X region. Extrapolated values can be entirely wrong. It is unreliable to predict the brain weight of a blue whale or the hog-nosed bat.

Residuals Estiamte the regression line (using JMP software or by calculating â and ˆb by hand). Then find the differnece between each observed Y i and the predicted value Ŷi using the fitted line. These differences are called the residuals. Plot each difference against the corresponding X i value. This plot is called a residual plot.

If the assumptions for linear regressin hold, what should on see in the residual plot? If the pattern of the residuals around the horizontal line at zero is: Curved, then the assumption of linearity is violated. fan-shaped, then the assumption of constant sd is violated. filled with many outliers, then again the assumption of constant sd is violated. shows a pattern (e.g. positive, negative, positive, negative), then the assumption of independent errors is violated.

When the residuals have a histogram that looks normal and when the residual plot shows no pattern, then we can use the normal distribution to make inferences about individuals. Suppose we do not make the log transformation. What percentage of 20-kilogram mammals have brain that weigh more than 180 grams? The regression equation says that the mean brainweight for 20 kilogram animals is 90.996 + 0.966 * 20 = 110.33. The sd of the residuals is 334.721. Under the regression assumptions, the 20-kilogram mammals have brainweights that are normally distributed with mean 110.33 and sd 334.721. The z-transformation is (180-110.33) / 334.72 = 0.208. From the table, the area under the curve to the right of 0.208 is (100-15.85) / 2 = 42.075%

Midterm I Instruction We will have Midterm I Thursday, July 13th. The exam is 12:30pm - 2:30pm. Do not be late! Office hour: 10:00am - 12:00am, Wednesday, July 12th, 211 Old Chem. The exam will cover all the materials we have discussed so far. The exam is open book, open lecture. You can use laptop if you wish. And if you choose to type, you should manage to send your answer as attachment to my email fei@stat.duke.edu by 2:30pm. Otherwise, the answer is not acceptable. The questions are expected similar to the exercises / review exercises / quiz 1. You should be able to finish the exam in 2 hours. When time is up, put your pens / pencils done while I am collecting the answers. Otherwise, you will get 0 score.

Designed Experiments and Observational Studies Double-blind, randomized, control study versus Observational Studies. Drug-placebo study. Lung cancer and smoking. Association does not imply causation. Confounding factors. Subgroup study or weighted average can help to understand the confounding factors.

Descriptive Statistics Central tendency: Mean, median (quantile, percentile), mode. Diespersion: standard deviation, range, IQR. Histograms, boxplots, and scatterplots.

Normal Distributions Use of the normal table. For a normal distribution, the probability that you observe a value within 1 sd is 68%, within 2 sd is 95%, and within 3 sd is 99.7%. Use of the z-transformation. Always draw pictures.

Correlation Correlation r measures the linear association between two variables. Calculate the correlation by z-transformation. r 2 is the coefficient of determination. It is the proportion of the variation in Y that is explained by X. No linear assocation does not imply no assocation. And association is not causation. Ecological correlation may be misleading.

Regression Fit the best line to the data. Regression effect in test-retest example. The formula for regression is, Y i = a + bx i + ɛ i We are assuming ɛ i N(0, sd = σ). And the ɛ s are independent.

Residuals: e i = Y i (a + bx i ). Find the regression line by minimizing the Sum of Squared Errors (SSE). f (a, b) = n (Y i (a + bx i )) 2 i=1 The Least Squred Estimators (LSE) are, â = Ȳ ˆb X; ˆb = 1 n n 1 X iy i XȲ n 1 X i 2 X 2 And estimates for the residuals are ê i = Y i (â + ˆbX i ) Data transformation. Extrapolation is risky. 1 n