Simple linear regression: linear relationship between two qunatitative variables. Linear Regression. The regression line

Similar documents
Relationships Regression

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Topic - 12 Linear Regression and Correlation

Chapter 11. Correlation and Regression

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Looking at data: relationships

Chapter 2: Looking at Data Relationships (Part 3)

Review of Regression Basics

Chapter 27 Summary Inferences for Regression

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Review of Regression Basics

1. Define the following terms (1 point each): alternative hypothesis

Correlation and regression. Correlation and regression analysis. Measures of association. Why bother? Positive linear relationship

ONLINE PAGE PROOFS. Relationships between two numerical variables

Chapter 5 Friday, May 21st

Biostatistics in Research Practice - Regression I

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

Linear correlation. Contents. 1 Linear correlation. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

Lecture 4 Scatterplots, Association, and Correlation

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

Chapter 13 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics

Chapter 3: Describing Relationships

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Ch. 3 Review - LSRL AP Stats

Lecture 4 Scatterplots, Association, and Correlation

Chapter 3: Describing Relationships

ab is shifted horizontally by h units. ab is shifted vertically by k units.

Chapter 6: Exploring Data: Relationships Lesson Plan

Graphs and polynomials

What is the easiest way to lose points when making a scatterplot?

Chapter 9. Correlation and Regression

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

Chapter 6. Exploring Data: Relationships

Correlation. Bret Hanlon and Bret Larget. Department of Statistics University of Wisconsin Madison. December 6, Correlation 1 / 25

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

y n 1 ( x i x )( y y i n 1 i y 2

CHAPTER. Scatterplots

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Dependence and scatter-plots. MVE-495: Lecture 4 Correlation and Regression

Non-Linear Regression Samuel L. Baker

f 0 ab a b: base f

Section 3.3. How Can We Predict the Outcome of a Variable? Agresti/Franklin Statistics, 1of 18

STATISTICAL DATA ANALYSIS IN EXCEL

Conditions for Regression Inference:

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

Analysis of Bivariate Data

appstats27.notebook April 06, 2017

Higher. Polynomials and Quadratics. Polynomials and Quadratics 1

Practice Questions for Exam 1

Chapter 5 Least Squares Regression

2-3. Linear Regression and Correlation. Vocabulary

Simple Linear Regression Using Ordinary Least Squares

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Introduction Direct Variation Rates of Change Scatter Plots. Introduction. EXAMPLE 1 A Mathematical Model

Data Analysis and Statistical Methods Statistics 651

Historical Note. Regression. Line of Best Fit

Graphs and polynomials

14.2 Choosing Among Linear, Quadratic, and Exponential Models

INFERENCE FOR REGRESSION

22S39: Class Notes / November 14, 2000 back to start 1

Lecture 18: Simple Linear Regression

12Variation UNCORRECTED PAGE PROOFS

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

Linear Regression Communication, skills, and understanding Calculator Use

23. Inference for regression

Math 243 OpenStax Chapter 12 Scatterplots and Linear Regression OpenIntro Section and

Business Statistics. Lecture 9: Simple Regression

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Correlation and simple linear regression S5

Number Plane Graphs and Coordinate Geometry

appstats8.notebook October 11, 2016

Chapter 6 Scatterplots, Association and Correlation

Stat 101 Exam 1 Important Formulas and Concepts 1

Nov 13 AP STAT. 1. Check/rev HW 2. Review/recap of notes 3. HW: pg #5,7,8,9,11 and read/notes pg smartboad notes ch 3.

Warm-up Using the given data Create a scatterplot Find the regression line

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

Information Sources. Class webpage (also linked to my.ucdavis page for the class):

Pre-Calculus Multiple Choice Questions - Chapter S8

Data transformation. Core: Data analysis. Chapter 5

Test 3A AP Statistics Name:

Mathematical Notation Math Introduction to Applied Statistics

The following formulas related to this topic are provided on the formula sheet:

Correlation & Simple Regression

Correlation and Regression

AMS 7 Correlation and Regression Lecture 8

28. SIMPLE LINEAR REGRESSION III

More Statistical Inference

Inference for Regression Inference about the Regression Model and Using the Regression Line

Vocabulary. Fitting a Line to Data. Lesson 2-2 Linear Models

Correlation and Regression

Section 0.4 Inverse functions and logarithms

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

IT 403 Practice Problems (2-2) Answers

Exam 3 Practice Questions Psych , Fall 9

SMAM 319 Exam1 Name. a B.The equation of a line is 3x + y =6. The slope is a. -3 b.3 c.6 d.1/3 e.-1/3

Mathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots

Ch 13 & 14 - Regression Analysis

STA 218: Statistics for Management

Transcription:

Linear Regression Simple linear regression: linear relationship etween two qunatitative variales The regression line Facts aout least-squares regression Residuals Influential oservations Cautions aout Correlation and Regression Correlation/regression using averages Lurking variales Association is not causation Correlation tells us aout strength (scatter) and direction of the linear relationship etween two quantitative variales. In addition, we would like to have a numerical description of how oth variales var together. For instance, is one variale increasing faster than the other one? And we would like to make predictions ased on that numerical description. But which line est descries our data? Here we will ask Minita to decide

The linear regression line A regression line is a straight line that descries how a response variale changes as an eplanator variale changes. We often use a regression line to predict the value of for a given value of. In regression, the distinction etween eplanator and response variales is important. Rätta linjers ekvation (repetition) A Statistical model Simple linear regression model X Y Y E { Y } slope is = dependent variale = independent variale = -intercept = slope of the line X ε= error variale, normall distriuted aout (mean is zero) with constant standard deviation. The error variale accounts for all the variales, oth measurale and unmeasurale that are not part of the model 2

Estimating regression line for data The least-squares regression (minsta-kvadrat regression) line is the unique line such that the sum of the squared vertical () distances etween the data points and the line is the smallest possile. Distances etween the oserved points and line are called residuals of the model for the data. Residuals can e oth positive and negative. Estimation of β and β β and β are unknown and are estimated ( Minita) from the data least square regression are and respectivel We get the estimated model: That is, we get ˆ ˆ -hat is the predicted response for an is the oserved value is the slope is the intercept An eample For randoml selected 6 students numer of studing hours for the eam and eam marks are recorded student hours marks () () 7 2 2 32 3 3 58 4 4 6 5 5 87 6 6 99 Total mean 35 58.83 3

Eample... Minita output: Regression Analsis: Marks versus Hours The regression equation is Marks =,3 +,65 Hours Predictor Coef SE Coef T P Constant,33 5,56,22,837 Hours,6486,324 2,45, S = 5,5386 R-Sq = 97,5% R-Sq(adj) = 96,9% Interpretations: =.3: when a student spends no hours for studing he/she gets.3 marks =.65: for each additional one hour of stud student gets an increase of.65 in his/her mark Another eample: ads and revenue In order to see the relationship etween numer of advertisents in local radio services ads and revenue (in SEK s) during a month, the manager of a fast food compan recorded them for several months. ads revenue 327,67 3 376,68 4 392,52 34 443,4 93 342,62 4 476,6 5 324,74 5 338,98..... Epected revenue 4.852 2. 473 ads = 2.473 When the numer of advertisements increases with one, the monthl revenue increases with 2.473 SEK (in ). = 4.852 When no advertising is done in local radios, the monthl revunue is 4.852 SEK (in s) 4

5 s s r E. s s r E. Coefficient of determination, r 2 r 2 represents the percentage of the variance in (vertical scatter from the regression line) that can e eplained changes in. s s r r 2, the coefficient of determination, is the square of the correlation coefficient.

r = - r 2 = Changes in eplain % of the variations in. Y can e entirel predicted for an given value of. r =.87 r 2 =.76 r = r 2 = Changes in eplain % of the variations in. The value(s) takes is (are) entirel independent of what value takes. Here the change in onl eplains 76% of the change in. The rest of the change in (the vertical scatter, shown as red arrows) must e eplained something other than. hours and marks eample Coefficient of determination S = 5,5386 R-Sq = 97,5% R-Sq(adj) = 96,9% 97.5% of the variance of marks is eplained hours. Facts aout least-squares regression 6

Etrapolation!!! Etrapolation is the use of a regression line for predictions outside the range of values used to otain the line.!!! This can e a ver stupid thing to do, as seen here. The intercept Sometimes the -intercept is not iologicall possile. Here we have negative lood alcohol content, which makes no sense But the negative value is appropriate for the equation of the regression line. -intercept shows negative lood alcohol There is a lot of scatter in the data, and the line is just an estimate. Residuals The distances from each point to the least-squares regression line give us potentiall useful information aout the contriution of individual data points to the overall pattern of scatter. These distances are called residuals. Points aove the line have a positive residual. The sum of these residuals is alwas. Points elow the line have a negative residual. Predicted ŷ Oserved dist. ( ˆ) residual 7

Residual plots Residuals are randoml scattered good! Curved pattern means the relationship ou are looking at is not linear. A change in variailit across plot is a warning sign. You need to find out wh it is, and rememer that predictions made in areas of larger variailit will not e as good. Regression diagnostics: ads and revenue 8

Assessing the Model The least squares method produces a regression line whether or not there are linear relationship etween X and Y. Consequentl, it is important to assess how well the linear model fits the data. Check if we have reasonal high R-square value. Check if regression assumptions are fulfilled: do the residuals of the model. Do the residuals have constant variation for all predicted values Draw a scatter plot of residuals against predicted values to see if spread is the same 2. Do the residuals independent from each other Draw a scatter plot of residuals against predicted values to see if the follow a regular pattern 3. Normal with aout zero: draw a histogram of the residuals and normal P-P plot Alwas plot our data A correlation coefficient and a regression line can e calculated for an relationship etween two quantitative variales. However, outliers can greatl influence the results. Also, running a linear regression on a nonlinear association is not onl meaningless ut misleading. log_, 8,7 2, 2, 2,8 2,5, 9,9 2,29, 8, 2,9 9, 6,8,92 4, 8, 2,9 5, 22, 3,9 6, 27, 3,3 4, 8,2 2,9 Y is transformed into log_ to get a linear relationship 7, 33, 3,5 3, 3,5 2,6 So, make sure to alwas plot our data efore ou run a correlation or regression analsis. However, making the scatterplots shows us that the correlation/ regression analsis is not appropriate for all data sets. Moderate linear association; regression OK. Ovious nonlinear relationship; regression not OK. One point deviates from the highl linear pattern; this outlier must e eamined closel efore proceeding. Just one ver influential point; all other points have the same value; a redesign is due here. 9

Lurking variales A lurking variale is a variale not included in the stud design that does have an effect on the variales studied. Lurking variales can falsel suggest a relationship. What is the lurking variale in these eamples? How could ou answer if ou didn t know anthing aout the topic? Strong positive association etween numer of firefighters at a fire site and the amount of damage a fire does. Negative association etween moderate amounts of wine drinking and death rates from heart disease in developed nations. Simpsons parado (s. 46-47 i Moore). Eempel: Vilket sjukhus är det ättre? 29 Men om vi även vet om patientens tillstånd innan operationen... 3

Vocaular: lurking vs. confounding A lurking variale is a variale that is not among the eplanator or response variales in a stud and et ma influence the interpretation of relationships among those variales. Two variales are confounded when their effects on a response variale cannot e distinguished from each other. The confounded variales ma e either eplanator variales or lurking variales. But ou often see them used interchangeal Association is not causation An association etween an eplanator variale and a response variale, even if it is ver strong, is not itself good evidence that changes in actuall cause changes in. Eample: There is a high positive correlation etween the numer of television sets per person () and the average life epectanc () for the world s nations. Could we lengthen the lives of people in Rwanda shipping them TV sets? The est wa to get evidence that causes is to do an eperiment in which we change and keep lurking variales under control. Caution efore rushing into a correlation or a regression analsis Do not use a regression on inappropriate data. Pattern in the residuals Presence of large outliers Use residual plots for help. Clumped data falsel appearing linear Beware of lurking variales. Avoid etrapolating (going eond interpolation). A relationship, however strong, does not itself impl causation.

Linear regression eample (multiple) 2