SAMPLE. Investigating the relationship between two numerical variables. Objectives

Similar documents
SAMPLE 4CORE. Displaying and describing relationships between two variables. 4.1 Investigating the relationship between two categorical variables

Summarising numerical data

Name Class Date. Residuals and Linear Regression Going Deeper

Year 10 Mathematics Semester 2 Bivariate Data Chapter 13

Chapter 6: Exploring Data: Relationships Lesson Plan

Linear Regression Communication, skills, and understanding Calculator Use

Scatterplots and Correlation

Chapter 3: Examining Relationships

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Chapter 12: Linear Regression and Correlation

5.1 Bivariate Relationships

Example: Can an increase in non-exercise activity (e.g. fidgeting) help people gain less weight?

AP Statistics Two-Variable Data Analysis

YEAR 10 GENERAL MATHEMATICS 2017 STRAND: BIVARIATE DATA PART II CHAPTER 12 RESIDUAL ANALYSIS, LINEARITY AND TIME SERIES

Describing Bivariate Relationships

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Data transformation. Core: Data analysis. Chapter 5

Important note: Transcripts are not substitutes for textbook assignments. 1

Chapter 12 : Linear Correlation and Linear Regression

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

SAMPLE. Describing the distribution of a single variable

3.2: Least Squares Regressions

Lecture 48 Sections Mon, Nov 16, 2009

BIOSTATISTICS NURS 3324

Overview. 4.1 Tables and Graphs for the Relationship Between Two Variables. 4.2 Introduction to Correlation. 4.3 Introduction to Regression 3.

Business Statistics. Lecture 10: Correlation and Linear Regression

Chapter 6. Exploring Data: Relationships

MATH 1150 Chapter 2 Notation and Terminology

The following formulas related to this topic are provided on the formula sheet:

Quantitative Bivariate Data

Module 1 Linear Regression

AP Statistics Bivariate Data Analysis Test Review. Multiple-Choice

BIVARIATE DATA data for two variables

y n 1 ( x i x )( y y i n 1 i y 2

Copyright, Nick E. Nolfi MPM1D9 Unit 6 Statistics (Data Analysis) STA-1

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Chapter 3: Examining Relationships

Prob/Stats Questions? /32

1 A Review of Correlation and Regression

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Bivariate data data from two variables e.g. Maths test results and English test results. Interpolate estimate a value between two known values.

Chapter 8. Linear Regression /71

Topic 10 - Linear Regression

Chapter 7. Scatterplots, Association, and Correlation

a. Yes, it is consistent. a. Positive c. Near Zero

Least Squares Regression

Scatterplots. 3.1: Scatterplots & Correlation. Scatterplots. Explanatory & Response Variables. Section 3.1 Scatterplots and Correlation

1) A residual plot: A)

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1

Chapter 9. Correlation and Regression

Chapter 3: Examining Relationships

6.1.1 How can I make predictions?

(c) Plot the point ( x, y ) on your scatter diagram and label this point M. (d) Write down the product-moment correlation coefficient, r.

Bivariate Data Summary

SECTION I Number of Questions 42 Percent of Total Grade 50

Mrs. Poyner/Mr. Page Chapter 3 page 1

CRP 272 Introduction To Regression Analysis

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

11 Correlation and Regression

Correlation & Regression. Dr. Moataza Mahmoud Abdel Wahab Lecturer of Biostatistics High Institute of Public Health University of Alexandria

Examining Relationships. Chapter 3

Introductory Statistics

NAME: DATE: SECTION: MRS. KEINATH

MFM1P Foundations of Mathematics Unit 3 Lesson 11

Final Exam - Solutions

Steps to take to do the descriptive part of regression analysis:

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Math 52 Linear Regression Instructions TI-83

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

CORRELATION ANALYSIS. Dr. Anulawathie Menike Dept. of Economics

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

The Simple Linear Regression Model

Sampling, Frequency Distributions, and Graphs (12.1)

Business Statistics. Lecture 9: Simple Regression

UCLA STAT 10 Statistical Reasoning - Midterm Review Solutions Observational Studies, Designed Experiments & Surveys

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

date: math analysis 2 chapter 18: curve fitting and models

MAL SHIELD. Student Textbook

Objectives. 2.1 Scatterplots. Scatterplots Explanatory and response variables Interpreting scatterplots Outliers

Hit the brakes, Charles!

Outline. Lesson 3: Linear Functions. Objectives:

Stat 101 Exam 1 Important Formulas and Concepts 1

Chapter 2.1 Relations and Functions

Intermediate Algebra Summary - Part I

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Linear Regression 3.2

Regression Analysis. BUS 735: Business Decision Making and Research

Key Concepts. Correlation (Pearson & Spearman) & Linear Regression. Assumptions. Correlation parametric & non-para. Correlation

Topic 2 Part 1 [195 marks]

appstats8.notebook October 11, 2016

Using a graphic display calculator

Relationships Regression

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

OCR Maths S1. Topic Questions from Papers. Bivariate Data

CORRELATION AND REGRESSION

8.1 Frequency Distribution, Frequency Polygon, Histogram page 326

Transcription:

C H A P T E R 23 Investigating the relationship between two numerical variables Objectives To use scatterplots to display bivariate (numerical) data To identify patterns and features of sets of data from scatterplots To identify positive, negative or no association between variables from a scatterplot To introduce the q-correlation coefficient to measure the strength of the relationship between two variables To introduce Pearson s product-moment correlation coefficient r to measure the strength of the linear relationship between two variables To fit a straight line to data by eye, and using the method of least squares To interpret the slope of a regression line and its intercept, if appropriate To predict the value of the dependent (response) variable from an independent (explanatory) variable, using a linear equation In Chapter 22 statistics of one variable were discussed. Sometimes values of a variable for more than one group have been examined, such as age of mothers and age of fathers, but only one variable was considered for each individual at a time. When two variables are observed for each subject, bivariate data are obtained. For example, it might be interesting to record the number of hours spent studying for an exam by each student in a class and the mark they achieved in the exam. If each of these variables were considered separately the methods discussed earlier would be used. It may be of more interest to examine the relationship between the two variables, in which case new bivariate techniques are required. When exploring bivariate data, questions arise such as, Is there a relationship between two variables? or Does knowing the value of one of the variables tell us anything about the value of the other variable? 554

Chapter 23 Investigating the relationship between two numerical variables 555 Consider the relationship between the number of cigarettes smoked per day and blood pressure. Since one opinion might be that varying the number of cigarettes smoked may affect blood pressure, it is necessary to distinguish between blood pressure, which is called the dependent or response variable, and the number of cigarettes, which is called the independent or explanatory variable. In this chapter some techniques are introduced which enable questions concerning the nature of the relationship between such variables to be answered. 23.1 Displaying bivariate data As with data concerning one variable, the most important first step in analysing bivariate data is the construction of a visual display. When both of the variables of interest are numerical then a scatterplot (or bivariate plot) may be constructed. This is the single most important tool in the analysis of such bivariate data, and should always be examined before further analysis is undertaken. The pairs of data points are plotted on the cartesian plane, with each pair contributing one point to the plot. Using the normal convention, the variable plotted horizontally is denoted as x, and the variable plotted vertically as y. The following example examines the features of the scatterplot in more detail. Example 1 The number of hours spent studying for an examination by each member of a class, and the marks they were awarded, are given in the table. Student 1 2 3 4 5 6 7 8 9 10 Hours 4 36 23 28 25 11 18 13 4 8 Mark 27 87 67 84 66 52 61 43 38 52 Student 11 12 13 14 15 16 17 18 19 20 Hours 4 19 6 19 1 29 33 36 28 15 Mark 41 54 57 62 23 65 75 83 65 55 Construct a scatterplot of these data. Solution The first decision to be made when preparing this scatterplot is whether to show Mark or Hours on the horizontal (x) axis. Since a student s mark is likely to depend on the hours that they spend studying, in this case Hours is the independent variable and Mark is the dependent variable. By convention, the independent variable is plotted on the horizontal (x) axis, and the dependent variable on the vertical (y) axis, giving the scatterplot shown.

556 Essential Advanced General Mathematics Mark y 80 60 40 20 0 10 20 30 40 Hours x From this scatterplot, a general trend can be seen of increasing marks with increasing hours of study. There is said to be a positive association between the variables. Twovariables are positively associated when larger values of y are associated with larger values of x,asshown in the previous scatterplot. Examples of variables which exhibit positive association are height and weight, foot size and hand size, and number of people in the family and household expenditure on food. Example 2 The age, in years, of several cars and their advertised price in a newspaper are given in the following table. Age (years) 4 6 5 7 4 2 3 3 Price ($) 13 000 9 800 11 000 8 300 10 500 15 800 14 300 13 800 Age (years) 7 6 4 6 4 8 6 Price ($) 9 700 9 500 13 200 10 000 11 800 8 000 12 200 Construct a scatterplot to display these data. Solution In this case the independent variable is the age of the car, which is plotted on the horizontal axis. The dependent variable, price, is plotted on the vertical axis. Price ($) y 16000 14000 12000 10000 8000 1 2 3 4 5 6 7 8 Age (years) x

Chapter 23 Investigating the relationship between two numerical variables 557 From the scatterplot a general trend of decreasing price with increasing age of car can be seen. There is said to be a negative association between the variables. Twovariables are negatively associated when larger values of y are associated with smaller values of x,asshown in the scatterplot above. Examples of other variables which exhibit negative association are weight and number of weeks spent on a diet program, hearing ability and age, and number of cold rainy days per week and sales of ice creams. The third alternative is that a scatterplot shows no particular pattern, indicating no association between the variables. y 8 6 4 2 0 1 2 3 4 5 6 7 8 x There is no association between two variables when the values of y are not related to the values of x,asshown in the preceding scatterplot. Examples of variables which show no association are height and IQ for adults, price of cars and fuel consumption, and size of family and number of pets. When one point, or a few points, do not seem to fit with the rest of the data they are called outliers. Sometimes a point is an outlier, not because its x value or its y value is in itself unusual, but rather because this particular combination of values is atypical. Consequently such an outlier cannot always be detected from single variable displays, such as stem-and-leaf plots. Forexample, consider this scatterplot. While the y variable plotted on the horizontal axis takes values 8 from 1 to 8 and the variable plotted on the vertical 6 axis takes values from 2 to 8, the combination (2, 8) 4 is clearly an outlier. 2 0 1 2 3 4 5 6 7 8 x

558 Essential Advanced General Mathematics Using the TI-Nspire The calculator can be used to construct a scatterplot of statistical data. The procedure is illustrated using the age and price of car data from Example 2. The data is easiest entered in a Lists & Spreadsheet application ( 3). Firstly, use the up/down arrows ( )to name the first column age and the second column price. Then enter the data as shown. Open a Data & Statistics application ( 5) tograph the data. At first the data displays as shown. Specify the x variable by selecting Add X Variable from the Plot Properties (b 2 4) and selecting age. Specify the y variable by selecting Add Y Variable from the Plot Properties (b 2 6) and selecting price. The data now displays as shown.

Chapter 23 Investigating the relationship between two numerical variables 559 Using the Casio ClassPad The table represents the results of 12 students in two tests. Test 1 score 10 18 13 6 8 5 12 15 15 Test 1 score 12 20 11 9 6 6 12 13 17 Enter the data into list1 (x) and list2 (y) inthe Setting...and select the tab for Graph3. (Note: Following on from the types of graphs in univariate statistics, this allows the Scatterplot settings to be remembered and called upon when required.) Ensure that all other graphs are de-selected and tap to produce the graph shown in the full screen. module. Tap SetGraph, Select the graph window (bold border) and tap Analysis, Trace to scroll from point to point and display the coordinates at the bottom of the graph. Exercise 23A Note: Save your data for 1 4 in named lists as they will be needed for later exercises. 1 The amount of a particular pain relief drug given to each patient and the time taken for the patient to experience pain relief are shown. a b c Patient 1 2 3 4 5 6 7 8 9 10 Drug dose (mg) 0.5 1.2 4.0 5.3 2.6 3.7 5.1 1.7 0.3 4.0 Response time (min) 65 35 15 10 22 16 10 18 70 20 Plot the response time against drug dose. From the scatterplot, describe any association between the two variables. Identify outliers, if any, and interpret. 2 The proprietor of a hairdressing salon recorded the amount spent advertising in the local paper and the business income for each month of a year, with the following results.

560 Essential Advanced General Mathematics Month Advertising ($) Business ($) 1 350 9 450 2 450 10 070 3 400 9 380 4 500 9 110 5 250 5 220 6 150 3 100 a b c Month Advertising ($) Business ($) 7 350 8 060 8 300 7 030 9 550 11 500 10 600 12 870 11 550 10 560 12 450 9 850 Plot the business income against the advertising expenditure. From the scatterplot, describe any association between the two variables. Identify outliers, if any, and interpret. 3 The number of passenger seats on the most commonly used commercial aircraft, and the airspeeds of these aircraft, in km/h, are shown in the following table. a b c Number of seats 405 296 288 258 240 230 193 188 Airspeed (km/h) 830 797 774 736 757 765 760 718 Number of seats 148 142 131 122 115 112 103 102 Airspeed (km/h) 683 666 661 378 605 620 576 603 Plot the airspeed against the number of seats. From the scatterplot, describe any association between the two variables. Identify outliers, if any, and interpret. 4 The price and age of several secondhand caravans are listed in the table. a b c Age (years) Price ($) 7 4 800 7 3 900 8 4 275 9 3 900 4 6 900 8 6 500 1 11 400 Age (years) Price ($) 10 8 700 9 1 950 9 3 300 11 1 650 3 9 600 4 8 400 7 6 600 Plot the price of the caravans against their age. From the scatterplot, describe any association between the two variables. Identify outliers, if any, and interpret. 23.2 The q-correlation coefficient If the plot of a bivariate data set shows a basic trend, apart from some randomness, then it is useful to provide a numerical measure of the strength of the relationship between the two variables. Correlation is a measure of strength of a relationship which applies only to

Chapter 23 Investigating the relationship between two numerical variables 561 numerical variables. Thus it is sensible, for example, to calculate the correlation between the heights and weights for a group of students, but not between height and gender, as gender is not a numerical variable. There are many different numerical measures of correlation, and each has different properties. In this section the q-correlation coefficient will be introduced. Consider the scatterplot of the number of hours spent by each member of a class when studying for an examination, and the mark they were awarded, from Example 1. This shows a positive association. To calculate the q-correlation coefficient, first find the median value for each of the variables separately. This can be done from the data, but it is usually simpler to calculate directly from the plot. There are 20 data points, and the median values are halfway between the 10th and 11th points, both vertically and horizontally. A vertical line is then drawn through the median x value, and a horizontal line through the median y value. The effect of this is to divide the plot into four regions, as shown. Marks y 80 60 40 20 0 10 20 30 40 Hours x Each of the four regions which have been created in this way is called a quadrant, and it can be noticed immediately that most of the points in this plot are in the upper right and lower left quadrants. In fact, wherever there is a positive association between variables this will be the case. Consider the scatterplot of the age of cars and the advertised price from Example 2, which shows negative association. Again the median value for each of the variables is found separately. There are 15 data points, giving the median values at the 8th points, both vertically and horizontally. A vertical line is then drawn through the median x value, and a horizontal line through the median y value. In this particular case they are coordinates of the same point, but this need not be so. Price ($) y 16000 14000 12000 10000 8000 1 2 3 4 5 6 7 8 Age (years) x

562 Essential Advanced General Mathematics It can be seen in this example that most points are in the upper left and the lower right quadrants, and this is true whenever there is a negative association between variables. These observations lead to a definition of the q-correlation coefficient. The q-correlation coefficient can be determined from the scatterplot as follows. Find the median of all the x values in the data set, and draw a vertical line through this value. Find the median of all the y values in the data set, and draw a horizontal line through this value. y 0 The plane is now divided into four quadrants. Label the quadrants A, B, C and D as shown in the diagram. Count the number of points in each of the quadrants A, B, C and D. Any points which lie on the median lines are omitted. Let a, b, c, d represent the number of points in each of the quadrants A, B, C and D respectively. Then the q-correlation coefficient is given by (a + c) (b + d) q = a + b + c + d Example 3 Use the scatterplot from Example 1 to determine the q-correlation coefficient for the number of hours each member of a class spent studying for an examination and the mark they were awarded. Solution There are nine points in quadrant A, one in quadrant B, nine in quadrant C and one in quadrant D. (a + c) (b + d) Thus q = a + b + c + d (9 + 9) (1 + 1) = 9 + 1 + 9 + 1 = 18 2 20 = 16 20 = 0.8 B C A D x

Chapter 23 Investigating the relationship between two numerical variables 563 Example 4 Use the scatterplot from Example 2 to determine the q-correlation coefficient for the age of cars and their advertised price. Solution There is one point in quadrant A, six in quadrant B, one in quadrant C and six in quadrant D. (a + c) (b + d) q = a + b + c + d (1 + 1) (6 + 6) = 1 + 6 + 1 + 6 = 2 12 14 = 10 14 = 0.71 From Examples 3 and 4 it can be seen that q-correlation coefficients may take both positive and negative values. Consider the situation when all the points are in the quadrants A and C. (a + c) (b + d) q = a + b + c + d = a + c a + c = 1 (since b and d are both equal to zero) Thus the maximum value the q-correlation coefficient may take is 1, and this indicates a measure of strong positive association. Suppose all the points are in the quadrants B and D. (a + c) (b + d) q = a + b + c + d (b + d) = b + d = 1 (since a and c are both equal to zero) Thus the minimum value the q-correlation coefficient may take is 1, and this indicates a measure of strong negative association.

564 Essential Advanced General Mathematics When the same number of points are in each of the quadrants A, B, C and D then: (a + c) (b + d) q = a + b + c + d 0 = a + b + c + d (since a = b = c = d) = 0 This value of the q-correlation coefficient clearly indicates that no association exists. q-correlation coefficients can be classified as follows: 1 q 0.75 strong negative relationship 0.75 < q 0.50 moderate negative relationship 0.50 < q 0.25 weak negative relationship 0.25 < q < 0.25 no relationship 0.25 q < 0.50 weak positive relationship 0.50 q < 0.75 moderate positive relationship 0.75 q 1 strong positive relationship Exercise 23B 1 Use the table of q-correlation coefficients to classify the following. a q = 0.20 b q = 0.30 c q = 0.85 d q = 0.33 e q = 0.95 f q = 0.75 g q = 0.75 h q = 0.24 i q = 1 j q = 0.25 k q = 1 l q = 0.50 2 Calculate the q-correlation coefficient for each pair of variables shown in the following scatterplots. a y 36 24 12 15.0 20.0 25.0 30.0 35.0 x

Chapter 23 Investigating the relationship between two numerical variables 565 b y Example 4 c d 280 210 140 y 8 6 4 2 0 y 80 60 40 20 0 120 160 200 240 280 1 2 3 4 5 6 7 8 10 20 30 3 The amount of a particular pain relief drug given to each patient and the time taken for the patient to experience pain relief are shown. Patient 1 2 3 4 5 6 7 8 9 10 Drug dose (mg) 0.5 1.2 4.0 5.3 2.6 3.7 5.1 1.7 0.3 4.0 Response time (min) 65 35 15 10 22 16 10 18 70 20 40 x x x a Use your scatterplot from 1,Exercise 23A to find the q-correlation coefficient for response time and drug dosage.

566 Essential Advanced General Mathematics Example 3 b Classify the strength and direction of the relationship between response time and drug dosage according to the table given. 4 The proprietor of a hairdressing salon recorded the amount spent advertising in the local paper and the business income for each month of a year, with the following results. Month Advertising ($) Business ($) 1 350 9 450 2 450 10 070 3 400 9 380 4 500 9 110 5 250 5 220 6 150 3 100 a b Month Advertising ($) Business ($) 7 350 8 060 8 300 7 030 9 550 11 500 10 600 12 870 11 550 10 560 12 450 9 850 Use your scatterplot from 2,Exercise 23A to find the q-correlation coefficient for advertising expenditure and total business conducted. Classify the strength and direction of the relationship between advertising expenditure and business income according to the table given. 5 The number of passenger seats on the most commonly used commercial aircraft, and the airspeeds of these aircraft, in km/h, are shown in the following table. a b Number of seats 405 296 288 258 240 230 193 188 Airspeed (km/h) 830 797 774 736 757 765 760 718 Number of seats 148 142 131 122 115 112 103 102 Airspeed (km/h) 683 666 661 378 605 620 576 603 Use your scatterplot from 3,Exercise 23A to find the q-correlation coefficient for the number of seats on an airline and the air speed. Classify the strength and direction of the relationship between the number of seats on an airline and the air speed according to the table given. 6 The price and age of several secondhand caravans are listed in the table. Age (years) Price ($) 7 4 800 7 3 900 8 4 275 9 3 900 4 6 900 8 6 500 1 11 400 Age (years) Price ($) 10 8 700 9 1 950 9 3 300 11 1 650 3 9 600 4 8 400 7 6 600

a b Chapter 23 Investigating the relationship between two numerical variables 567 Use your scatterplot from 4,Exercise 23A to find the q-correlation coefficient for price and age of secondhand caravans. Classify the strength and direction of the relationship between price and age of secondhand caravans according to the table given. 23.3 The correlation coefficient When a relationship is linear the most commonly used measure of strength of the relationship is Pearson s product-moment correlation coefficient, r. Itgives a numerical measure of the degree to which the points in the scatterplot tend to cluster around a straight line. Pearson s product-moment correlation is defined to be degree which the variables vary together r = degree which the two variables vary separately Formally, if we call the two variables x and y and we have n observations then Pearson s product-moment correlation for this set of observations is r = 1 n ( )( ) xi x yi ȳ n 1 i=1 s x s y where x and s x are the mean and standard deviation of the x scores and ȳ and s y are the mean and standard deviation of the y scores. There are two key assumptions made in calculating Pearson s correlation coefficient r. They are the data is numerical the relationship being described is linear. Pearson s correlation coefficient r has the following properties: If there is no linear relationship, r = 0. y 0 r = 0 For a perfect positive linear relationship, r =+1. y x For a perfect negative linear relationship, r = 1. y x x 0 r = +1 0 r = 1

Verbal Age Calf 568 Essential Advanced General Mathematics Otherwise, 1 r +1 Pearson s correlation coefficient r can be classified as follows: 1 r 0.75 strong negative linear relationship 0.75 r 0.50 moderate negative linear relationship 0.50 r 0.25 weak negative linear relationship 0.25 < r < 0.25 no linear relationship 0.25 r < 0.50 weak positive linear relationship 0.50 r < 0.75 moderate positive linear relationship 0.75 r 1 strong positive linear relationship The following scatterplots show linear relationships of various strengths together with the corresponding value of Pearson s product-moment correlation coefficient. CO level 14 12 10 100 150 200 250 300 350 400 Traffic volume Carbon monoxide level in the atmosphere and traffic volume: r = 0.985 Mortality 160 140 120 100 80 60 80 100 120 Smoking ratio Mortality rate due to lung cancer and smoking ratio (100 average): r = 0.716 700 600 500 400 500 600 700 800 Mathematics 30 25 20 15 600 800 1000 1200 1400 Testosterone level Age first convicted and testosterone (a male hormone) level of a group of prisoners: r = 0.814 Score 110 100 90 8 10 12 14 16 18 20 Age 1st word Score on aptitude test (taken later in life) and age (in months) when first word spoken: r = 0.445 Scores on standardised tests of verbal and mathematical ability: r = 0.275 20 15 10 5 30 40 50 Age Calf measurement and age of adult males: r = 0.005

Chapter 23 Investigating the relationship between two numerical variables 569 Using the TI-Nspire The value Pearson s product-moment correlation coefficient, r, can be calculated using the calculator. This will be illustrated using the age and price of car data from Example 2. With the data entered as the two lists age and price respectively, we now open a Calculator application ( 1)tocalculate Pearson s product-moment correlation coefficient. Use Linear Regression (mx + b) from the Stat Calculations submenu of the Statistics menu (b 613) and complete the dialog box as shown. Press enter to obtain the regression information including the value for r as shown. Using the Casio ClassPad Consider the following set of data x 1 3 5 4 7 y 2 5 7 2 9 Enter the data into list1 (x) and list2 (y). Tap Calc, Linear Reg and select the settings shown. Tap OK to produce the results shown in the second screen. The value of r is shown in the answer box. TapOKtoproduce a scatterplot.

570 Essential Advanced General Mathematics After the mean and standard deviation, Pearson s product-moment correlation is one of the most frequently computed descriptive statistics. It is a powerful tool but it is also easily misused. The presence of a linear relationship should always be confirmed with a scatterplot before Pearson s product-moment correlation is calculated. And, like the mean and the standard deviation, Pearson s correlation coefficient r is very sensitive to the presence of outliers in the sample. Exercise 23C 1 Use the table of Pearson s correlation coefficients r to classify the following. a r = 0.20 b r = 0.30 c r = 0.85 d r = 0.33 e r = 0.95 f r = 0.75 g r = 0.75 h r = 0.24 i r = 0.50 j r = 0.25 k r = 1 l r = 1 2 By comparing the plots given to those on page 538 estimate the value of Pearson s correlation coefficient r. a y b y c e 36 24 12 y 8 6 4 2 0 15.0 20.0 25.0 30.0 35.0 1 2 3 4 5 6 7 8 x x d 280 210 140 y 80 60 40 20 0 120 160 200 240 y 14000 12000 10000 8000 f y 8 6 4 2 280 10 20 30 40 x x 1 2 3 4 5 6 7 8 x 0 1 2 3 4 5 6 7 8 x

Chapter 23 Investigating the relationship between two numerical variables 571 3 The amount of a particular pain relief drug given to each patient and the time taken for the patient to experience relief are shown. a b Patient 1 2 3 4 5 6 7 8 9 10 Drug dose (mg) 0.5 1.2 4.0 5.3 2.6 3.7 5.1 1.7 0.3 4.0 Response time (min) 65 35 15 10 22 16 10 18 70 20 Determine the value of Pearson s correlation coefficient. Classify the relationship between drug dose and response time according to the table given. 4 The proprietor of a hairdressing salon recorded the amount spent on advertising in the local paper and the business income for each month for a year, with the following results. Month Advertising ($) Business ($) 1 350 9 450 2 450 10 070 3 400 9 380 4 500 9 110 5 250 5 220 6 150 3 100 a b Month Advertising ($) Business ($) 7 350 8 060 8 300 7 030 9 550 11 500 10 600 12 870 11 550 10 560 12 450 9 850 Determine the value of Pearson s correlation coefficient. Classify the relationship between the amount spent on advertising and business income according to the table given. 5 The number of passenger seats on the most commonly used commercial aircraft, and the airspeeds of these aircraft, in km/h, are shown in the following table. a b Number of seats 405 296 288 258 240 230 193 188 Airspeed (km/h) 830 797 774 736 757 765 760 718 Number of seats 148 142 131 122 115 112 103 102 Airspeed (km/h) 683 666 661 378 605 620 576 603 Determine the value of Pearson s correlation coefficient. Classify the relationship between the number of passenger seats and airspeed according to the table given.

572 Essential Advanced General Mathematics 6 The price and age of several secondhand caravans are listed in the table. a b Age (years) Price ($) 7 4 800 7 3 900 8 4 275 9 3 900 4 6 900 8 6 500 1 11 400 Age (years) Price ($) 10 8 700 9 1 950 9 3 300 11 1 650 3 9 600 4 8 400 7 6 600 Determine the value of Pearson s correlation coefficient. Classify the relationship between price and age according to the table given. 7 The following are the scores for a group of ten students who each had two attempts at a test (out of 70). a b Attempt 1 53 56 57 49 44 69 66 40 53 43 68 64 Attempt 2 63 66 67 58 54 70 70 55 63 53 70 70 Construct a scatterplot of these data, and describe the relationship between scores on attempt 1 and attempt 2. Is it appropriate to calculate the value of Pearson s correlation coefficient for these data? Give reasons for your answer. 8 This table represents the results of two Student Test 1 Test 2 different tests for a group of students. 1 214 216 a Construct a scatterplot of these data, and 2 281 270 describe the relationship between scores on 3 212 281 Test 1 and Test 2. 4 324 326 b Is it appropriate to calculate the value of 5 240 243 Pearson s correlation coefficient for these 6 208 213 data? Give reasons for your answer. 7 303 311 c Determine the values of the q-correlation 8 278 290 coefficient and Pearson s correlation coefficient r. 9 311 320 d Classify the relationship between Test 1 and Test 2 using both the q-correlation coefficient and Pearson s correlation coefficient r, and compare. e It turns out that when the data was entered into the student records, the result for Test 2, Student 9 was entered as 32 instead of 320. i Recalculate the values of the q-correlation coefficient and Pearson s correlation coefficient r with this new data value. ii Compare these values with the ones calculated in c.

Chapter 23 Investigating the relationship between two numerical variables 573 23.4 Lines on scatterplots If a linear relationship exists between two variables it is possible to predict the value of the dependent variable from the value of the independent variable. The stronger the relationship between the two variables the better the prediction that is made. To make the prediction it is necessary to determine an equation which relates the variables and this is achieved by fitting a line to the data. Fitting a line to data is often referred to as regression, which comes from a Latin word regressum which means moved back. The simplest equation relating two variables x and y is a linear equation of the form y = a + bx where a and b are constants. This is similar to the general equation of a straight line, where a represents the coordinate of the point where the line crosses the y axis (the y axis intercept), and b represents the slope of the line. In order to summarise any particular (x, y) data set, numerical values for a and b are needed that will make the line pass close to the data. There are several ways in which the values of a and b can be found, of which the simplest is to find the straight line by placing a ruler on the scatter diagram, and drawing a line by eye, which appears to follow the general trend of the data. Example 6 The following table gives the gold medal winning distance, in metres, for the men s long jump for the Olympic games for the years 1896 to 1996. (Some years were missing owing to the two world wars.) Find a straight line which fits the general trend of the data, and use it to predict the winning distance in the year 2008. Year 1896 1900 1904 1908 1912 1920 1924 1928 1932 1936 1948 1952 1956 Distance (m) 6.35 7.19 7.34 7.49 7.59 7.16 7.44 7.75 7.65 8.05 7.82 7.57 7.82 Year 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 Distance (m) 8.13 8.08 8.92 8.26 8.36 8.53 8.53 8.72 8.67 8.50 8.55 8.59 Solution Distance 9.0 8.5 8.0 7.5 7.0 6.5 6.0 1900 1920 1940 1960 1980 2000 Year

574 Essential Advanced General Mathematics Note that this scatterplot does not start at the origin. Since the values of the coordinates that are of interest on both axes are a long way from zero, it is sensible to plot the graph for that range of values only. In fact, any values less than 1896 on the horizontal axis are meaningless in this context. The line shown on the scatterplot is only one of many which could be drawn. To enable the line to be used for prediction it is necessary to find its equation. To do this, first determine the coordinates of any two points through which it passes on the scatterplot. Appropriate points are (1932, 7.65) and (1976, 8.36). The equation of the straight line is then found by substituting in the formula which gives the equation for a straight line between two points. y y 1 = y 2 y 1 x x 1 x 2 x 1 y 7.65 8.36 7.65 = x 1932 1976 1932 = 0.71 44 = 0.016 y 7.65 = 0.016(x 1932) y = 0.016x 23.26 or distance = 23.26 + 0.016 year The intercept for this equation is 23.26 m. In theory, this is the winning distance for the year 0! In practice, there is no meaningful interpretation for the y axis intercept in this situation. But the same cannot be said about the slope. A slope of 0.016 means that on average the gold medal winning distance increases by about 1.6 centimetres at each successive games. Using this equation the gold medal winning distance for the long jump in 2008 would be predicted as y = 23.26 + 0.016 2008 = 8.87 m Obviously, attempting to project too far into the future may give us answers which are not sensible. When using an equation for prediction, derived from data, it is sensible to use values of the explanatory variable which are within a reasonable range of the data. The relationship between the variables may not be linear if we move too far from the known values.

Chapter 23 Investigating the relationship between two numerical variables 575 Example 7 The following table gives the alcohol consumption per head (in litres) and the hospital admission rate to each of the regions of Victoria in 1994 95. Per capita Hospital consumption admissions per Region (litres of alcohol) 1000 residents Loddon Mallee 9.0 42.0 Grampians 8.4 44.7 Barwon 8.7 38.6 Gippsland 9.1 44.7 Hume 10.0 41.0 Western Metropolitan 9.0 40.4 Northern Metropolitan 6.7 36.2 Eastern Metropolitan 6.2 32.3 Southern Metropolitan 8.1 43.0 Find a straight line which fits the general trend of the data, and interpret the intercept and slope. Solution Admissions 45.0 40.0 35.0 30.0 0 6.0 7.0 8.0 9.0 Alcohol consumption 10.0 One possible line passes through the points (7, 36) and (9, 42). y y 1 Thus = y 2 y 1 x x 1 x 2 x 1 y 36 42 36 = x 7 9 7 = 6 2 = 3 y 36 = 3(x 7) y = 3x + 15 or admission rate = 15 + 3 alcohol consumption The intercept for this equation is 15, implying that we predict a hospital admission rate would be 15 per 1000 residents for a region with 0 alcohol consumption. While this is interpretable, it would be a brave prediction as it is well out of the range of the

576 Essential Advanced General Mathematics Example 6 Example 7 data. A slope of 3 means that on average the admission rate rises by 3 per 1000 residents for each additional litre of alcohol consumed per capita. Exercise 23D 1 Plot the following set of data points on graph paper. x 0 1 2 3 4 5 6 7 8 y 1 3 6 7 7 11 13 18 17 Draw a straight line which fits the data by eye, and find an equation for this line. 2 Plot the following set of points on graph paper. x 3 2 1 0 1 2 3 4 y 5 2 0 6 7 11 13 20 Draw a straight line which fits the data by eye, and find an equation for this line. 3 The numbers of burglaries during two successive years for various districts in one state are given in the following table. a Make a scatterplot of the data. District Year 1 (x) Year 2 (y) b Find the equation of a straight line which A 3233 2709 relates the two variables. B 2363 2208 c Describe the trend in burglaries in this state. C 4591 3685 D 4317 4038 E 2474 2792 F 3679 3292 G 5016 4402 H 6234 5147 I 6350 5555 J 4072 4004 K 2137 1980 4 The following data give a girl s height (in cm) between the ages of 36 months and 60 months. Age (x) 36 40 44 52 56 60 Height (y) 84 87 90 92 94 96 a Make a scatterplot of the data. b Find the equation of a straight line which relates the two variables. c Interpret the intercept and slope, if appropriate. d Use your equation to estimate the girl s height at age i 42 months ii 18 years e How reliable are your answers to d?

Chapter 23 Investigating the relationship between two numerical variables 577 5 The following table gives the adult heights (in cm) of ten pairs of mothers and daughters. a b c Mother (x) 170 163 157 165 175 160 164 168 152 173 Daughter (y) 178 175 165 173 168 152 163 168 160 178 Make a scatterplot of the data. Find the equation of a straight line which relates the two variables. Estimate the adult height of a girl whose mother is 170 cm tall. 6 The manager of a company which manufactures MP3 players keeps a weekly record of the cost of running the business and the number of units produced. The figures for a period of eight weeks are shown in the table. a b c d Number of MP3 players produced (x) 100 160 80 100 220 150 170 200 Cost in 000 s $ (y) 2.5 3.3 2.4 2.6 4.1 3.1 3.5 3.8 Make a scatterplot of the data. Find the equation of a straight line which relates the two variables. What is the manufacturer s fixed cost for operating the business each week? What is the cost of production of each unit, over and above this fixed operating cost? 7 The amount of a particular pain relief drug given to each patient and the time taken for the patient to experience pain relief are shown. Patient 1 2 3 4 5 6 7 8 9 10 Drug dose (mg) 0.5 1.2 4.0 5.3 2.6 3.7 5.1 1.7 0.3 4.0 Response time (min) 65 35 15 10 22 16 10 18 70 20 a Find the equation of a straight line which relates the two variables. b Interpret the intercept and slope if appropriate. c Use your equation to predict the time taken for the patient to experience pain relief if 6 mg of the drug is given. Is this answer realistic? 8 The proprietor of a hairdressing salon recorded the amount spent on advertising in the local paper and the business income for each month for a year, with the results shown. a Find the equation of a straight line which relates the two variables. b Interpret the intercept and slope if appropriate. c Use your equation to predict the business income which would be attracted if the proprietor of the salon spent the following amounts on advertising: i $1000 ii $0 Month Advertising ($) Business ($) 1 350 9 450 2 450 10 070 3 400 9 380 4 500 9 110 5 250 5 220 6 150 3 100 7 350 8 060 8 300 7 030 9 550 11 500 10 600 12 870 11 550 10 560 12 450 9 850

578 Essential Advanced General Mathematics 23.5 The least squares regression line Fitting a line to a scatterplot by eye is not generally the best way of modelling a relationship. What is required is a method which uses a more objective criterion. A simple method is two divide the data set into two halves on the basis of the median x value, and to fit a line which passes through the mean x and y values of the lower half, and the mean x and y values of the upper half. This is called the two-mean line, and while easy to determine, it is not very widely used. The most common procedure is the method of least squares. The least squares regression line is the line for which the sum of squares of the vertical deviations from the data to the line (as indicated in the diagram) is a minimum. These deviations are called the residuals. y 35 30 25 20 15 10 5 0 1 2 3 4 (x i, y i ) 5 6 7 8 9 y = a + bx Consider the line y = a + bx We would like to find a and b such that the sum of the residuals is zero. That is, n (y i a bx i ) = 0 1 i=1 and the sum of residuals square is as small as possible. That is, n (y i a bx i ) 2 is a minimum 2 i=1 We will use the symbol S to denote From 1, n (y i a bx i ) 2 i=1 n (y i a bx i ) = 0 i=1 n y i na b i=1 n x i = 0 i=1 ȳ a b x = 0 a = ȳ b x 3 10 x

Chapter 23 Investigating the relationship between two numerical variables 579 Substituting this relationship in 2 n S = [y i (ȳ b x) bx i ] 2 = = i=1 n [(y i ȳ) b(x i x)] 2 i=1 n [(y i ȳ) 2 2b(x i x)(y i ȳ) + b 2 (x i x) 2 ] i=1 This can be thought of as a quadratic expression in b. In order to find the value of b which minimises S, wewill differentiate with respect to b and set the derivative equal to zero. Simplifying gives b = ds n db = 2 (x i x)(y i ȳ) + 2b i=1 = 0 n (x i x)(y i ȳ) i=1 4 n (x i x) 2 i=1 n (x i x) 2 Equations 3 and 4 can then be used to calculate the least squares estimates of the y axis intercept and the slope. Using the TI-Nspire The calculator can be used to construct the least squares regression line. The procedure is illustrated using the age and price of car data from Example 2. It has previously been illustrated how to enter the data as two lists, age and price respectively, and from these construct the scatterplot in a Data & Statistics application resulting in the data displayed as shown. i=1

580 Essential Advanced General Mathematics Now use Show Linear (mx + b) from the Regression submenu of the Analyze menu (b 461)toplace the regression line on the scatterplot as shown. Using the Casio ClassPad The following data gives the heights (in cm) and weights (in kg) of 11 people. Height (x) 177 182 167 178 173 184 162 169 164 170 180 Weight (y) 74 75 62 63 64 74 57 55 56 68 72 Enter the data into list1 (x) and list2 (y). Tap Calc, Linear Reg and select the settings shown. Tap OK to produce the results shown in the second screen. The format of the formula, y = ax + b is shown at the top and the values of a, b are shown in the answer box. TapOKtoproduce a scatterplot showing the regression line. Note: The formula can be automatically copied into a selected entry line if desired by selecting a graph number, e.g. y1 inthe Copy Formula line. After the equation of the least squares line has been determined, we can interpret the intercept and slope in terms of the problem at hand, and use the equation to make predictions. The method of least squares is also sensitive to any outliers in the data.

Chapter 23 Investigating the relationship between two numerical variables 581 Example 8 Consider again the gold medal winning distance, in metres, for the men s long jump for the Olympic games for the years 1896 to 2004. Find the equation of the least squares regression line for these data, and use it to predict the winning distance for the year 2008. Year 1896 1900 1904 1908 1912 1920 1924 1928 1932 1936 1948 1952 1956 Distance (m) 6.35 7.19 7.34 7.49 7.59 7.16 7.44 7.75 7.65 8.05 7.82 7.57 7.82 Year 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 Distance (m) 8.13 8.08 8.92 8.26 8.36 8.53 8.53 8.72 8.67 8.50 8.55 8.59 Solution Using a calculator or computer the equation is found to be distance = 23.87 + 0.0163 year which is quite similar to the equation to the line fitted by eye. The predicted distance for the year 2008 is Example 9 distance = 23.87 + 0.0163 2008 = 8.86 m Consider again the data from Example 7 which related alcohol consumption per head (in litres) and the hospital admission rate to each of the regions of Victoria in 1994 95. Per capita consumption Hospital admissions Region (litres of alcohol) per 1000 residents Loddon Mallee 9.0 42.0 Grampians 8.4 44.7 Barwon 8.7 38.6 Gippsland 9.1 44.7 Hume 10.0 41.0 Western Metropolitan 9.0 40.4 Northern Metropolitan 6.7 36.2 Eastern Metropolitan 6.2 32.3 Southern Metropolitan 8.1 43.0 Find the equation of the least squares regression line which fits these data. Solution Using a calculator or computer the equation is found to be admissions = 19.9 + 2.45 alcohol which is slightly different from the line fitted by eye.

582 Essential Advanced General Mathematics Example 8 Correlation and causation The existence of even a strong linear relationship between two variables is not, in itself, sufficient to imply that altering one variable causes a change in the other. It only implies that this might be the explanation. It may be that both the measured variables are affected by a third and different variable. For example, if data about crime rates and unemployment in a range of cities were gathered, a high correlation would be found. But could it be inferred that high unemployment causes a high crime rate? The explanation could be that both of these variables are dependent on other factors, such as home circumstances, peer group pressure, level of education or economic conditions, all of which may be related to both unemployment and crime rates. These two variables may vary together, without one being the direct cause of the other. Exercise 23E 1 The following data give a girl s height (in cm) between the ages of 36 months and 60 months. Age (x) 36 40 44 52 56 60 Height (y) 84 87 90 92 94 96 a Using the method of least squares find the equation of a straight line which relates the two variables. b Interpret the intercept and slope, if appropriate. c Use your equation to estimate the girl s height at age i 42 months ii 18 years d How reliable are your answers to part c? 2 The number of burglaries during two successive years for various districts in one state are given in the following table. Using the method of least squares find the equation of a straight line which relates the two variables. District Year 1 (x) Year 2 (y) A 3 233 2 709 B 2 363 2 208 C 4 591 3 685 D 4 317 4 038 E 2 474 2 792 F 3 679 3 292 G 5 016 4 402 H 6 234 5 147 I 6 350 5 555 J 4 072 4 004 K 2 137 1 980

Example 9 Chapter 23 Investigating the relationship between two numerical variables 583 3 The following table gives the adult heights (in cm) of ten pairs of mothers and daughters. a b c Mother (x) 170 163 157 165 175 160 164 168 152 173 Daughter (y) 178 175 165 173 168 152 163 168 160 178 Using the method of least squares find the equation of a straight line which relates the two variables. Interpret the slope in this context. Estimate the adult height of a girl whose mother is 170 cm tall. 4 The manager of a company which manufactures MP3 players keeps a weekly record of the cost of running the business and the number of units produced. The figures for a period of eight weeks are: Number of MP3 players produced (x) 100 160 80 100 220 150 170 200 Cost in 000 s $ (y) 2.5 3.3 2.4 2.6 4.1 3.1 3.5 3.8 a Using the method of least squares find the equation of a straight line which relates the two variables. b What is the manufacture s fixed cost for operating the business each week? c What is the cost of production of each unit, over and above this fixed operating cost? 5 The amount of a particular pain relief drug given to each patient and the time taken for the patient to experience pain relief are shown. a b c Patient 1 2 3 4 5 6 7 8 9 10 Drug dose (mg) 0.5 1.2 4.0 5.3 2.6 3.7 5.1 1.7 0.3 4.0 Response time (min) 65 35 15 10 22 16 10 18 70 20 Using the method of least squares find the equation of a straight line which relates the two variables. Interpret the intercept and slope if appropriate. Use your equation to predict the time taken for the patient to experience pain relief if 6 mg of the drug is given. Is this answer realistic? 6 The proprietor of a hairdressing salon recorded the amount spent on advertising in the local paper and the business income for each month for a year, with the results shown. a Using the method of least squares find the equation of a straight line which relates the two variables. b Interpret the intercept and slope if appropriate. Month Advertising ($) Business ($) 1 350 9 450 2 450 10 070 3 400 9 380 4 500 9 110 5 250 5 220 6 150 3 100 7 350 8 060 8 300 7 030 9 550 11 500 10 600 12 870 11 550 10 560 12 450 9 850

584 Essential Advanced General Mathematics c Use your equation to predict the volume which would be attracted if the proprietor of the salon spent the following amounts on advertising. i $1000 ii $0 Using a CAS calculator with statistics II How to construct a scatterplot Construct a scatterplot for the following set of test scores. Treat Test 1 as the independent (x-) variable. Test 1 score 10 18 13 6 8 5 12 15 15 Test 2 score 12 20 11 9 6 6 12 13 17 Enter the data into your calculator using the Stats/List Editor. Type the data into list1 and list2. Setup the calculator to plot a statistical graph. a Press F2 to access the Plots menu. b Press ENTER to select 1:Plot Setup. c Press F1 to Define Plot1. d Complete the dialogue box as follows: For Plot Type: select 1:Scatter Leave Mark:asBox. For x, type in list1. For y, type in list2. Pressing ENTER confirms your selection and returns you tothe Plot Setup menu. Pressing F5 (Zoom Data) in Plot Setup automatically plots the scatter plot in a properly scaled window.

Chapter 23 Investigating the relationship between two numerical variables 585 How to calculate the correlation coefficient r Use a calculator to calculate the correlation coefficient r for the following data. x 1 3 5 4 7 y 2 5 7 2 9 Give the answer correct to two decimal places. Enter the data into your calculator using the Stats/List Editor. Type the data into list1 and list2. Calculate the correlation coefficient. a Press F4 to access the Calculate menu. b Move the cursor down ( )toregressions and across ( )tothe Regressions menu. c Press ENTER to select 1: LinReg(a+bx). This will take you to the LinReg(a+bx) dialogue box. d Complete the dialogue box. For Xlist: type in list1. For Ylist: type in list2. e Press ENTER to obtain the results. How to determine the equation of a least squares regression The following data gives the heights (in cm) and weights (in kg) of 11 people. Height (x) 177 182 167 178 173 184 162 169 164 170 180 Weight (y) 74 75 62 63 64 74 57 55 56 68 72 Use a graphics calculator to determine the equation of the least squares regression line that will enable weight to be predicted from height. Enter the data into your calculator using the Stats/List Editor.

586 Essential Advanced General Mathematics Type the data into list1 and list2. a Press F4 to access the Calculate menu. b Move the cursor down (D)toRegressions and across (B)tothe Regressions menu. c Press ENTER to select 1: LinReg(a+bx). This will take you to the LinReg(a+bx) dialogue box. d Complete the dialogue box. For Xlist: type in list1. For Ylist: type in list2. e Press ENTER to obtain the results.

Chapter 23 Investigating the relationship between two numerical variables 587 Chapter summary Bivariate data arises when measurements on two variables are collected for each subject. A scatterplot is an appropriate visual display of bivariate data if both of the variables are numerical. A scatterplot of the data should always be constructed to assist in the identification of outliers and illustrate the association (positive, negative or none). Twovariables are positively associated when larger values of y are associated with larger values of x.two variables are negatively associated when larger values of y are associated with smaller values of x. There is no association between two variables when the values of y are not related to the values of x. When constructing the scatterplot, the independent or explanatory variable is plotted on the horizontal (x) axis, and the dependent or response variable is plotted on the vertical (y) axis. If a linear relationship is indicated by the scatterplot a measure of its strength can be found by calculating the q-correlation coefficient, orpearson s product-moment correlation coefficient, r. If the values on a scatterplot are divided by lines representing the median of x and the median of y into four quadrants A, B, C and D, with a, b, c, d representing the number of points in each quadrant respectively, then the q-correlation coefficient is given by (a + c) (b + d) q = a + b + c + d Pearson s product-moment correlation, r, isameasure of strength of linear relationship between two variables, x and y.ifwehaven observations then for this set of observations r = 1 n ( )( ) xi x yi ȳ n 1 i=1 s x s y where x and s x are the mean and standard deviation of the x scores and ȳ and s y are the mean and standard deviation of the y scores. For these correlation coefficients 1 q 1 1 r 1 with values close to ±1 indicating strong correlation, and those close to 0 indicating little correlation. If a linear relationship is indicated from the scatterplot a straight line may be fitted to the data, either by eye or using the least squares regression method. The least squares regression line is the line for which the sum of squares of the vertical deviations from the data to the line is a minimum. The value of the slope (b)gives the extent of the change in the dependent variable associated with a unit change in the independent variable. Once found, the equation to the straight line may be used to predict values of the response variable (y) from the explanatory variable (x). The accuracy of the prediction depends on how closely the straight line fits the data, and an indication of this can be obtained from the correlation coefficient. Review

588 Essential Advanced General Mathematics Review Multiple-choice questions 1 Forwhich one of the following pairs of variables would it be appropriate to construct a scatterplot? A eye colour (blue, green, brown, other) and hair colour (black, brown, blonde, red, other) B score out of 100 on a test for a group of Year 9 students and a group of Year 11 students C political party preference (Labor, Liberal, Other) and age in years D age in years and blood pressure in mm Hg E height in cm and gender (male, female) 2 Forwhich one of the following plots would it be appropriate to calculate the value of the q-correlation coefficient? A B C D E 3 A q-correlation coefficient of 0.32 would describe a relationship classified as A weak positive B moderate positive C strong positive D close to zero E moderately strong 4 The scatterplot shows the relationship between age and the number of alcoholic drinks consumed on the weekend by a group of people. The value of the q-correlation coefficient is closest to A 1 B 7 9 No. of drinks 20 15 10 5 C 5 6 D 7 9 E 1 0 10 20 30 40 50 60 70 Age

Chapter 23 Investigating the relationship between two numerical variables 589 5 The following scatterplot shows the relationship between height and weight for a group of people. The value of the Pearson s product-moment correlation coefficient r is closest to A 1 B 0.8 C 0.5 D 0.3 E 0 Weight 220 200 180 160 140 120 100 80 140 150 160 170 180 190 Height (cm) Questions 6 and 7 relate to the following information. The weekly income and weekly expenditure on food for a group of 10 university students is given in the following table. 200 Weekly income ($) 150 250 300 600 300 380 950 450 850 1000 Weekly food expenditure ($) 40 60 70 120 130 150 200 260 460 600 6 The value of the Pearson product-moment correlation coefficient r for these data is closest to A 0.2 B 0.4 C 0.6 D 0.7 E 0.8 7 The least squares regression line which would enable expenditure on food to be predicted from weekly income is closest to A 0.482 + 42.864 weekly income B 0.482 42.864 weekly income C 42.864 + 0.482 weekly income D 239.868 + 1.355 weekly income E 1.355 + 239.868 weekly income Review

590 Essential Advanced General Mathematics Review Questions 8 and 9 relate to the following information. Suppose that the least squares regression line which would enable expenditure on entertainment (in dollars) to be predicted from weekly income is given by Weekly expenditure on entertainment = 40 + 0.10 Weekly income 8 Using this rule the expenditure on entertainment by an individual with an income of $600 per week is predicted to be A $40 B $24 060 C $100 D $46 E $240 9 From this rule which of the following statements is correct? A On average for each extra dollar of income an extra 10 cents is spent on entertainment B On average for each extra 10 cents in income an extra $1 is spent on entertainment C On average for each extra dollar of income an extra 40 cents is spent on entertainment D On average people spend $40 per week on entertainment E On average people spend $50 per week on entertainment 10 For the scatterplot shown the line of best fit would have a slope closest to: 250 A 0.1 B 0.1 C 10 D 10 E 200 Short-answer questions 200 150 100 50 10 12 14 Technology is required to answer some of the following questions. 1 The following table gives the number of times Inside 50 Score (points) the ball was inside the 50 metre line in an AFL 64 90 football game, and the team s score in that game 57 134 a Plot the score against the number of Inside 50s. 34 76 b From the scatterplot, describe any association 61 92 between the two variables. 51 93 2 Use the scatterplot constructed in 1 to determine 52 45 q-correlation between the score and the number of 53 120 Inside 50s. 51 66 64 105 55 108 58 88 71 133 16 18 20 22

Chapter 23 Investigating the relationship between two numerical variables 591 3 The distance traveled to work and the time taken for a group of company employees are given in the following table. Determine the value of the Pearson product-moment correlation r for these data. Distance (kms) 12 50 40 25 45 20 10 3 10 30 Time (mins) 15 75 50 50 80 50 10 5 10 35 4 The following scatterplot shows the relationship between height and weight for a group of people. Draw a straight line which fits the data by eye, and find an equation for this line. weight 220 200 180 160 140 120 100 80 140 150 160 170 180 190 200 height (cm) 5 The time taken to complete a task, and the number of errors on the task, were recorded for a sample of 10 primary school children. Determine the equation of the least squares regression line which fits these data. 6 For the data in 5: a Interpret the intercept and slope of the least squares regression line. b Use the least squares regression line to predict the number of errors which would be observed for a child who took 10 seconds to complete the task. Time (seconds) Errors 22.6 2 21.7 3 21.7 3 21.3 4 19.3 5 17.6 5 17.0 7 14.6 7 14.0 9 8.8 9 Review

592 Essential Advanced General Mathematics Review Extended-response questions 1 A marketing company wishes to predict the likely number of new clients each of its graduates will attract to the business in their first year of employment, by using their scores on a marketing exam in the final year of their course. a Which is the independent variable Number of new and which is the dependent variable? Exam score clients b Construct a scatterplot of these data. 65 7 c Describe the association between the 72 9 Number of new clients and Exam 68 8 score. 85 10 d Determine the value of the q-correlation 74 10 coefficient for these data, and classify 61 8 the strength of the relationship. 60 6 e Determine the value of the Pearson 78 10 product-moment correlation coefficient 70 5 for these data and classify the strength of the relationship. 82 11 f Determine the equation for the least squares regression line and write it down in terms of the variables Number of new clients and Exam score. g Interpret the intercept and slope of the least squares regression line in terms of the variables in the study. h Use your regression equation to predict to the nearest whole number the Number of new clients for a person who scored 100 on the exam. i How reliable is the prediction made in h? 2 To investigate the relationship between marks on an assignment and the final examination mark a sample of 10 students was taken. The table indicates the marks for the assignment and the final exam mark for each individual student. a Which is the independent variable Assignment mark Final exam mark and which is the dependent variable? (max = 80) (max = 90) b Construct a scatterplot of these data. 80 83 c Describe the association between the 77 83 assignment mark and exam mark. 71 79 d Determine the value of the q-correlation 78 75 coefficient for these data, and classify the 65 68 strength of the relationship. 80 84 e Determine the value of the Pearson 68 71 product-moment correlation coefficient 64 69 for these data and classify the strength of 50 66 the relationship. 66 58

Chapter 23 Investigating the relationship between two numerical variables 593 f Use your answer to d to comment on the statement: Good final exam marks are the result of good assignment marks. g Determine the equation for the least squares regression line and write it down in terms of the variables Final exam mark and Assignment mark. h Interpret the intercept and slope of the least squares regression line in terms of the variables in the study. i Use your regression equation to predict the Final exam mark for a student who scored 50 on the assignment. j How reliable is the prediction made in i? 3 A marketing firm wanted to investigate the relationship between airplay and CD sales (in the following week) of newly released songs. Data was collected on a random sample of 10 songs. a Which is the independent variable and which No. of times the Weekly sales is the dependent variable? song was played of the CD b Construct a scatterplot of these data. 47 3950 c Describe the association between the number 34 2500 of times the song was played and weekly sales. 40 3700 d Determine the value of the q-correlation 34 2800 coefficient for these data, and classify the strength 33 2900 of the relationship. 50 3750 e Determine the value of the Pearson 28 2300 product-moment correlation coefficient for these 53 4400 data and classify the strength of the relationship. 25 2200 f Determine the equation for the least squares regression line and write it down in terms of the 46 3400 variables Number of times the song was played and Weekly sales. g Interpret the intercept and slope of the least squares regression line in terms of the variables in the study. h Use your regression equation to predict the weekly sales for a song which was played 60 times. i How reliable is the prediction made in h? Review