We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Similar documents
Chapter 13: The Correlation Coefficient and the Regression Line. We begin with a some useful facts about straight lines.

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

TEST 3A AP Statistics Name: Directions: Work on these sheets. A standard normal table is attached.

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

AP Statistics Notes Unit Two: The Normal Distributions

Differentiation Applications 1: Related Rates

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Physics 2010 Motion with Constant Acceleration Experiment 1

READING STATECHART DIAGRAMS

SPH3U1 Lesson 06 Kinematics

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

CHM112 Lab Graphing with Excel Grading Rubric

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

CESAR Science Case The differential rotation of the Sun and its Chromosphere. Introduction. Material that is necessary during the laboratory

, which yields. where z1. and z2

We can see from the graph above that the intersection is, i.e., [ ).

Experiment #3. Graphing with Excel

Introduction to Spacetime Geometry

NUMBERS, MATHEMATICS AND EQUATIONS

Trigonometric Ratios Unit 5 Tentative TEST date

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Lab 1 The Scientific Method

Getting Involved O. Responsibilities of a Member. People Are Depending On You. Participation Is Important. Think It Through

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

AP Physics Kinematic Wrap Up

Math 105: Review for Exam I - Solutions

Thermodynamics Partial Outline of Topics

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

Distributions, spatial statistics and a Bayesian perspective

Sections 15.1 to 15.12, 16.1 and 16.2 of the textbook (Robbins-Miller) cover the materials required for this topic.

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

How do scientists measure trees? What is DBH?

**DO NOT ONLY RELY ON THIS STUDY GUIDE!!!**

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Lab #3: Pendulum Period and Proportionalities

Math Foundations 20 Work Plan

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Activity Guide Loops and Random Numbers

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS

Kinetic Model Completeness

37 Maxwell s Equations

20 Faraday s Law and Maxwell s Extension to Ampere s Law

Physics 2B Chapter 23 Notes - Faraday s Law & Inductors Spring 2018

Five Whys How To Do It Better

What is Statistical Learning?

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

CESAR Science Case Rotation period of the Sun and the sunspot activity

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Math Foundations 10 Work Plan

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Physics 212. Lecture 12. Today's Concept: Magnetic Force on moving charges. Physics 212 Lecture 12, Slide 1

5 th grade Common Core Standards

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Lifting a Lion: Using Proportions

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

B. Definition of an exponential

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Corrections for the textbook answers: Sec 6.1 #8h)covert angle to a positive by adding period #9b) # rad/sec

End of Course Algebra I ~ Practice Test #2

1b) =.215 1c).080/.215 =.372

Plan o o. I(t) Divide problem into sub-problems Modify schematic and coordinate system (if needed) Write general equations

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Computational modeling techniques

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

The standards are taught in the following sequence.

Hypothesis Tests for One Population Mean

LCAO APPROXIMATIONS OF ORGANIC Pi MO SYSTEMS The allyl system (cation, anion or radical).

Revised 2/07. Projectile Motion

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

CHAPTER 2 Algebraic Expressions and Fundamental Operations

Biochemistry Summer Packet

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

BASD HIGH SCHOOL FORMAL LAB REPORT

AP Physics. Summer Assignment 2012 Date. Name. F m = = + What is due the first day of school? a. T. b. = ( )( ) =

Pipetting 101 Developed by BSU CityLab

I.S. 239 Mark Twain. Grade 7 Mathematics Spring Performance Task: Proportional Relationships

Interference is when two (or more) sets of waves meet and combine to produce a new pattern.

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

SticiGui Chapter 4: Measures of Location and Spread Philip Stark (2013)

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

Math 10 - Exam 1 Topics

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Preparation work for A2 Mathematics [2018]

THE LIFE OF AN OBJECT IT SYSTEMS

I. Analytical Potential and Field of a Uniform Rod. V E d. The definition of electric potential difference is

Basics. Primary School learning about place value is often forgotten and can be reinforced at home.

Math 0310 Final Exam Review Problems

Transcription:

Chapter 13: The Crrelatin Cefficient and the Regressin Line We begin with a sme useful facts abut straight lines. Recall the x, y crdinate system, as pictured belw. 3 2 1 y = 2.5 y = 0.5x 3 2 1 1 2 3 1 2 3 y = 1 x + 337 We say that y is a linear functin f x if y = a + bx, fr sme numbers a and b. If y = a + bx then the graph f the functin is a straight line with y-intercept equal t a and slpe equal t b. The line is hrizntal if, and nly if, b = 0;.w. it slpes up if b > 0 and slpes dwn if b < 0. The nly lines nt cvered by the abve are the vertical lines, e.g. x = 6. Vertical lines are nt interesting in Statistics. In math class we learn that lines extend frever. In statistical applicatins, as we will see, they never extend frever. This distinctin is very imprtant. In fact, it wuld be mre accurate t say that statisticians study line segments, nt lines, but everybdy says lines. It will be very imprtant fr yu t understand lines in tw ways, what I call visually and analytically. + 338 Here is what I mean. Cnsider the line y = 5+2x. We will want t substitute (plugin) values fr x t learn what we get fr y. Fr example, x = 3. We d this analytically by substituting in the equatin: y = 5 + 2(3) = 11. But we can als d this visually, by graphing the functin. Walk alng the x axis until we get t x = 3 and then climb up a rpe (slide dwn a ple) until we hit the line. ur height when we hit the line is y = 11. (Draw picture n bard.) The Scatterplt We are interested in situatins in which we btain tw numbers per subject. Fr example, if the subjects are cllege students, the numbers culd be: X = height and Y = weight. X = scre n ACT and Y = first year GPA. + 339 X = number f AP credits and Y = first year GPA. Law schls are interested in: X = LSAT scre and Y = first year law schl GPA. and s n. In each f these examples, the Y is cnsidered mre imprtant by the researcher and is called the respnse. The X is imprtant b/c its value might help us understand Y better and it is called the predictr. Fr sme studies, reasnable peple can disagree n which variable t call Y. Here are tw examples: The subjects are married cuples and the variables are: wife s IQ and husband s IQ. The subjects are identical twins and the variables are: first brn s IQ and secnd brn s IQ. We study tw big tpics in Chapter 13. Fr the first f these, the crrelatin cefficient, it des nt matter which variable is called Y. + 340

Fr the secnd f these, the regressin line, changing the assignment f Y and X will change the answer. Thus, if yu are uncertain n the assignment, yu might chse t d the regressin line analysis twice, nce fr each assignment. The material in Chapter 13 differs substantially frm what we have dne in this class. In Chapter 13, we impse fairly strict structure n hw we view the data. This structure allws researchers t btain very elabrate answers frm a small amunt f data. Perhaps surprisingly, these answers have a histry f wrking very well in science. But it will be imprtant t have a healthy skepticism abut the answers we get and t examine the data carefully t decide whether the impsed structure seems reasnable. We begin with an example with n = 124 subjects, a very large number f subjects fr these prblems. As we will see, ften n is 10 r smaller. + 341 The subjects are 124 men wh played majr league baseball in bth 1985 and 1986. This set cntains every man wh had at least 200 fficial at-bats in the American League in bth years. The variables are: Y = 1986 Batting Average (BA) and X = 1985 BA. The idea is that, as a baseball executive, yu might be interested in learning hw effectively ffensive perfrmance ne year (1985) can predict ffensive perfrmance the next year (1986). In case yu are nt a baseball fan, here is all yu need t knw abut this example. BA is a measure f ffensive perfrmance, with larger values better. BA is nt really an average; it is a prprtin: BA equals number f hits divided by number f fficial at-bats. BA is always reprted t three digits f precisin and a BA f, say, 0.300 is referred t as hitting 300. BTW, 300 is the threshld fr gd hitting and 200 is the threshld fr really bad hitting. + 342 The names and data fr the 124 men are n pp. 442 3. Behaving like the MITK, we first study the variables individually, fllwing the ideas f Chapter 12. X 0.180 0.240 0.300 0.360 Y 0.180 0.240 0.300 0.360 These histgrams suggest small and large utliers bth years. In additin, bth histgrams are clse t symmetry and bell-shape. Als, the means and sd s changed little frm X t Y. Year Mean St.Dev. 1985 0.2664 0.0280 1986 0.2636 0.0320 + 343 Belw is the scatterplt f these BA data. The first thing we lk fr are islated cases (IC). I see tw, pssibly three, IC identified by initials belw: WB, DM and FR. 1986 Batting Ave. 0.370 DM WB 0.330 2 0.290 2 2 0.250 0.210 FR 0.170 0.170 0.210 0.250 0.290 0.330 0.370 1985 Batting Ave. + 344

Nw, ignre the utliers and lk fr a pattern in the remaining data. Fr the BA data, the data describe an ellipse that is tilted upwards (lwer t the left, higher t the right). This is an example f a linear relatinship between X and Y ; i.e. as X grws larger (sweep yur eyes frm left t right in the picture), the Y values tend t increase (becme higher). In Chapter 13, we limit attentin t data sets that reveal a linear relatinship between X and Y. If yur data d nt fllw a linear relatinship, yu shuld nt use the methds f Chapter 13. Thus, yur analysis shuld always begin with a scatterplt t investigate whether a linear relatinship is reasnable. Page 447 f the text presents five hypthetical scatterplts: ne reveals an increasing linear pattern; ne reveals a decreasing linear pattern; and the remaining three shw varius curved relatinships between X and Y. Thus, t reiterate; if yur scatterplt is curved, d nt use the methds f Chapter 13. + 345 Page 448 f the text presents fur scatterplts fr data sets fr small values f n (the n s are 9, 6, 12 and 13, typical sizes in practice). The subjects are spiders and the fur scatterplts crrespnd t fur categries f spiders. Fr each spider, Y is heart rate and X is weight. Abve each scatterplt is the numerical value f r, the crrelatin cefficient f the data. At this time, it suffices t nte that r > 0 indicates (reflects?) an increasing linear relatinship and r < 0 indicates a decreasing linear relatinship between Y and X. There are tw imprtant ideas revealed by these scatterplts. First, fr small n it can be difficult t decide whether a case is islated; whenever pssible, use yur scientific knwledge t help with this decisin. Secnd, especially fr a small n, the presence f ne r tw islated cases can drastically change ur view f the data. Fr example, cnsider the n = 9 small hunters. + 346 The tw spiders in the lwer left f the scatterplt might be labeled islated. Including these cases, the text states that r > 0, but if they are deleted frm the data set (which culd be a deliberate actin by the researcher, r perhaps these guys were stepped n during their cmmute t the lab) then r < 0. Scientists typically get very excited abut whether r is psitive r negative, s it is ntewrthy that its sign can change s easily. Thus far, we have been quite casual abut lking at scatterplts. We say, The pattern is linear and lks increasing (decreasing, flat). It will remain (in this curse) the jb f ur eyes and brain t decide n linearity, but the matter f increasing r decreasing will be usurped by the statisticians. Furthermre, using my eyes and brain, I can say that the pattern is decreasing fr tarantulas and fr web weavers (r agrees with me), and I can say that the linear pattern is strnger fr the tarantulas. + 347 The crrelatin cefficient agrees with me n the issue f strength and has the further benefit f quantifying the ntin f strnger in a manner that is useful t scientists. I am nt very gd at mtivating the frmula fr the crrelatin cefficient. In additin, the end f the semester is near, s time is limited. The interested student is referred t pp. 450 3 f the text fr a (partial) explanatin f the frmula. Here is the briefest f presentatins f the frmula. Each subject has an x and a y. We standardize these values int x and y : x = (x x)/s X ; y = (y ȳ)/s Y. We then frm the prduct z = x y. The idea is that z > 0 prvides evidence f an increasing relatinship and z < 0 prvides evidence f a decreasing relatinship. + 348

(The prduct is psitive if bth terms are psitive r bth are negative. Bth psitive means a large x is matched with a large y; bth negative means a small x is matched with a small y.) The crrelatin cefficient cmbines the z s by almst cmputing their mean: z r = n 1. The next slide presents 12 prttypes f the crrespndence between a scatterplt and its crrelatin cefficient. These 12 scatterplts illustrate six imprtant facts abut crrelatin cefficients. These six facts appear n pages 454 and 456 f the text and will nt be reprinted here. r = 1.00 r = 0.40 r = 0.20 r = 0.80 r = 0.80 r = 0.60 r = 0.20 r = 0.00 r = 0.40 r = 0.60 r = 1.00 r = 0.00 + 349 + 350 13.3: The regressin line. ẏ = 37.5 + 0.25x Air Temp. 90 80 70 60 100 150 200 Chirps per Minute ŷ = 56.2 + 0.136x Air Temp. 90. 80 70 60 100 150 200 Chirps per Minute x y ẏ y ẏ (y ẏ) 2 145.0 62.6 73.75 11.15 124.32 172.0 81.5 80.50 1.00 1.00 155.0 77.9 76.25 1.65 2.72 137.0 84.2 71.75 12.45 155.00 179.5 92.8 82.37 10.43 108.68 192.0 86.9 85.50 1.40 1.96 207.0 87.8 89.25 1.45 2.10 165.5 69.8 78.87 9.07 82.36 193.0 71.6 85.75 14.15 200.22 100.0 71.6 62.50 9.10 82.81 189.0 80.4 84.75 4.35 18.92 SSE(ẏ) = 780.10 + 351 x y ŷ y ŷ (y ŷ) 2 145.0 62.6 75.92 13.32 177.42 172.0 81.5 79.59 1.91 3.64 155.0 77.9 77.28 0.62 0.38 137.0 84.2 74.83 9.37 87.76 179.5 92.8 80.61 12.19 148.55 192.0 86.9 82.31 4.59 21.05 207.0 87.8 84.35 3.45 11.89 165.5 69.8 78.71 8.91 79.35 193.0 71.6 82.45 10.85 117.68 100.0 71.6 69.80 1.80 3.24 189.0 80.4 81.90 1.50 2.26 SSE(ŷ) = 653.23 n n = 11 ccasins, Susan Rbrds determined tw values fr different crickets: Y is the air temperature and X is the cricket s chirp rate in chirps per minute. In her campcraft class, she was tld that ne can calculate the air temperature with the fllwing equatin: ẏ = 37.5 + 0.25x. Abve, we have a scatterplt f Susan s data with this line. The mst bvius fact is that calculate was way t ptimistic! + 352

As Ygi Berra nce said, Yu can bserve a lt by just watching. Let s fllw his advice and examine the scatterplt and table abve. We see that n sme ccasins, ẏ prvides an accurate predictin f y. Visually, this is represented by circles that are n, tuching, r nearly tuching the line. But n many ther ccasins, the predictins are pr: the line is either far lwer than the circle (the predictin is t small) r the line is far higher than the circle (the predictin is t large). Next, we d smething very strange. We change perspective and instead f saying that the predictin is t small (large) we say that the bservatin is t large (small). Egcentric? Yes, but there are tw reasns. First, lk at the scatterplt and line again. It is easier t fcus n the line and see hw the pints deviate frm it, than it is t fcus n all the pints (n culd be large) and see hw the line deviates. + 353 Secnd, we plan t cmpare ẏ and y by subtractin. We culd use ẏ y r y ẏ. The frmer takes y as the standard and the latter reverses the rles. Fr circles belw the line, I want this errr t be a negative number; t get that I must subtract in the rder y ẏ; that is, I take the predictin as the standard and the bservatin errs by nt agreeing. Lk at the table again. The ideal fr the errr y ẏ is 0. As the errr mves away frm 0, in either directin, the inadequacy f the predictin becmes mre and mre serius. Fr math reasns (and ften it makes sense scientifically; at least apprximately) we cnsider an errr f, say, 5 t be exactly as serius as an errr f +5. As in Chapter 12, we might be tempted t achieve this by taking the abslute value f each errr, but, again, we get much better math results by squaring the errrs. Finally, we sum all f the squared errrs t btain: SSE(ẏ) = 780.10. + 354 Ideally, SSE = 0 and the larger it is, the wrse the predictin. Yu are prbably thinking that we need t adjust SSE t accunt fr sample size, but we wn t bther with that. Instead, we pse the fllwing questin: Can we imprve n Susan s line? r: Can we find anther predictin line which has an SSE that is smaller than Susan s 780.10? I suggest the line ŷ = 56.2+0.136x. Frm the table, we see that SSE(ŷ) = 653.23. Thus, accrding t The Principle f Least Squares ŷ is superir t ẏ. Can we d better than my ŷ? N. Majr Result: There is always a unique line that minimizes SSE ver all pssible lines. The equatin f the line is given as ŷ = b 0 + b 1 x, where b 1 = r(s Y /s X ) and b 0 = ȳ b 1 x. + 355 Fr the cricket data, fr example, it can be shwn that x = 166.8, s X = 31.0, ȳ = 78.83, s Y = 9.11, and r = 0.461. Substituting these values int the abve yields b 1 = 0.461(9.11/31.0) = 0.1355, and b 0 = 78.83 0.1355(166.8) = 56.23. Thus, the equatin f the best predictin line is ŷ = 56.23 + 0.1355x, which I runded in my earlier presentatin f it. The means and sd s f the BA data were given n slide 343 and it has r = 0.554. Thus, b 1 = 0.554(0.032/0.028) = 0.633, and b 0 = 0.2636 0.633(0.2664) = 0.095. Thus, the equatin f the regressin line is ŷ = 0.095 + 0.633x. This line appears n page 471 f the text. + 356

We have seen that it is easy t calculate ŷ and it is the best line pssible (based n the principle f least squares), but is it any gd? (Is Sylvester Stallne s best perfrmance any gd? Is there a reasn he has never dne Shakespeare?) First, nte that we can see why r is s imprtant. We need five numbers t calculate ŷ: tw numbers that tell us abut x nly; tw numbers that tell us abut y nly; and ne number (r) that tells us hw x and y relate t each ther. In ther wrds, r tells us all we need t knw abut the assciatin between x and y. We btain the regressin line by calculating tw numbers: b 0 and b 1. Thus, bviusly, this pair f numbers is imprtant. Als, b 1, the slpe, is imprtant by itself; it tells us hw a change in x affects ur predictin ŷ. Unlike mathematics, hwever, the intercept, b 0, alne usually is nt f interest. Nw in math, the intercept is interpreted as the value f y when x = 0. Cnsider ur examples. Fr the Cricket study, x = 0 gives us ŷ = 56.2. But we have n data at r near x = 0; thus, we really dn t knw what it means fr x t equal 0. (Discuss.) Similarly, fr the BA study, x = 0 predicts a 1986 BA f 0.095. But nbdy batted at r near 0.000 in 1985. In fact, I cnjecture that in the histry f baseball there has never been a psitin player with at least 200 at-bats wh batted 0.000. Cnsider the fllwing scatterplt f fish activity versus water temperature fr fish in an aquarium. (Shuld we use these data t predict fish activity fr x = 32? Fr x = 212?) + 357 + 358 Fish Activity 500 450 400 350 300 707274767880 Water Temp. (F) The abve cnsideratins has resulted in sme statisticians advcating a secnd way t write the equatin fr ŷ: Fr the cricket study: ŷ = ȳ + b 1 (x x). ŷ = 78.83 + 0.461( 9.11 )(x 166.8) = 31.0 78.83 + 0.1355(x 166.8). This secnd frmula cntains three numbers and they all have meaning: the mean f the predictr; the mean f the respnse and the slpe. Fr better r wrse, this frmulatin has nt becme ppular and yu are nt respnsible fr it n the final. It des, hwever, give us an easy prf f ne f the mst imprtant features f the regressin line, smething I like t call: The law f preservatin f medicrity! Suppse that a subject is medicre n x; that is, the subject s x = x. What is the predicted respnse fr this subject? Plugging x = x int we get ŷ = ȳ + b 1 (x x) ŷ = ȳ + b 1 ( x x) = ŷ = ȳ + b 1 (0) = ȳ. + 359 + 360