Stat 20 Midterm 1 Review

Similar documents
Math 147 Lecture Notes: Lecture 12

Review. Midterm Exam. Midterm Review. May 6th, 2015 AMS-UCSC. Spring Session 1 (Midterm Review) AMS-5 May 6th, / 24

Section 5.4. Ken Ueda

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Quadratic Equations Part I

MATH 1150 Chapter 2 Notation and Terminology

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

Correlation and regression

Algebra & Trig Review

PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

DIFFERENTIAL EQUATIONS

Please bring the task to your first physics lesson and hand it to the teacher.

Chapter 7 Summary Scatterplots, Association, and Correlation

Essential Question: What are the standard intervals for a normal distribution? How are these intervals used to solve problems?

GRE Quantitative Reasoning Practice Questions

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

Math 31 Lesson Plan. Day 2: Sets; Binary Operations. Elizabeth Gillaspy. September 23, 2011

Statistics 100 Exam 2 March 8, 2017

Introduction to Statistics for Traffic Crash Reconstruction

Relationships between variables. Visualizing Bivariate Distributions: Scatter Plots

Review for Final Exam, MATH , Fall 2010

STEP 1: Ask Do I know the SLOPE of the line? (Notice how it s needed for both!) YES! NO! But, I have two NO! But, my line is

Math 5a Reading Assignments for Sections

Chapter 1 Review of Equations and Inequalities

1.1 Linear Equations and Inequalities

Conceptual Explanations: Simultaneous Equations Distance, rate, and time

LECTURE 15: SIMPLE LINEAR REGRESSION I

Physics Motion Math. (Read objectives on screen.)

Math Fundamentals for Statistics I (Math 52) Unit 7: Connections (Graphs, Equations and Inequalities)

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Elementary Statistics

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Confidence Intervals. - simply, an interval for which we have a certain confidence.

2.5 Regression. 2.5 Regression 225

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

The First Derivative Test

Introduction. So, why did I even bother to write this?

COLLEGE ALGEBRA. Solving Equations and Inequalities. Paul Dawkins

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Chapter 6. September 17, Please pick up a calculator and take out paper and something to write with. Association and Correlation.

The Shape, Center and Spread of a Normal Distribution - Basic

6 THE NORMAL DISTRIBUTION

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Descriptive Statistics (And a little bit on rounding and significant digits)

Francine s bone density is 1.45 standard deviations below the mean hip bone density for 25-year-old women of 956 grams/cm 2.

8.1 Frequency Distribution, Frequency Polygon, Histogram page 326

Chapter 27 Summary Inferences for Regression

DE CHAZAL DU MEE BUSINESS SCHOOL AUGUST 2003 MOCK EXAMINATIONS IOP 201-Q (INDUSTRIAL PSYCHOLOGICAL RESEARCH)

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

Machine Learning, Fall 2009: Midterm

Chapter 5. Understanding and Comparing. Distributions

GMAT Arithmetic: Challenge (Excerpt)

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Algebra Exam. Solutions and Grading Guide

Data Analysis and Statistical Methods Statistics 651

Sampling, Frequency Distributions, and Graphs (12.1)

Measures of Central Tendency and their dispersion and applications. Acknowledgement: Dr Muslima Ejaz

Chapter 7. Practice Exam Questions and Solutions for Final Exam, Spring 2009 Statistics 301, Professor Wardrop

Midterm 2 - Solutions

Introduction to Algebra: The First Week

appstats27.notebook April 06, 2017

Practice Questions for Final Exam - Math 1060Q - Fall 2014

Lesson 21 Not So Dramatic Quadratics

6 th Grade Math. Full Curriculum Book. Sample file. A+ Interactive Math (by A+ TutorSoft, Inc.)

Continuity and One-Sided Limits

#29: Logarithm review May 16, 2009

Astronomy 102 Math Review

Section 1.x: The Variety of Asymptotic Experiences

Relationships Between Quantities

Lecture 10: Powers of Matrices, Difference Equations

Pre-calculus is the stepping stone for Calculus. It s the final hurdle after all those years of

Exam #2 Results (as percentages)

Ch. 16: Correlation and Regression

Algebra Year 10. Language

Final Exam - Solutions

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Prealgebra. Edition 5

Finite Mathematics : A Business Approach

EQ: What is a normal distribution?

Unit 6 - Introduction to linear regression

MATH 1130 Exam 1 Review Sheet

Unit 6 - Simple linear regression

Algebra 1 S1 Lesson Summaries. Lesson Goal: Mastery 70% or higher

Chapter 5 Simplifying Formulas and Solving Equations

MATH CRASH COURSE GRA6020 SPRING 2012

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Chapter 18. Sampling Distribution Models /51

Homework 2 Solutions

To factor an expression means to write it as a product of factors instead of a sum of terms. The expression 3x

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Confidence intervals CE 311S

STT 315 This lecture is based on Chapter 2 of the textbook.

Chapter 5 Least Squares Regression

Chapter 2: Tools for Exploring Univariate Data

3.1 Measure of Center

Transcription:

Stat 20 Midterm Review February 7, 2007 This handout is intended to be a comprehensive study guide for the first Stat 20 midterm exam. I have tried to cover all the course material in a way that targets what you need to know to do well on the test. If you understand everything in this handout, then you have a great shot at getting an A. However, I cannot guarantee that I ve covered everything that might be asked. In preparing for the exam, I suggest you look at the section headings and decide which topics you know well and which ones need more attention. Within each section, I ve tried to include examples of problems that rely upon understanding the concepts and formulas. These are the easy problems. At the end of the hand-out I m posting more difficult problems that are more representative of the harder problems that might be on the exam. The main difference between the easy and hard problems is that more difficult problems usually require you to do several steps that may encompass more than one concept or formula. Each of the steps is not so bad on its own, but sometimes it can be difficult to put them all together. Learning statistics is kind of like building a house it takes some time to put the foundation in place before you can think about the walls and roof. The best way to prepare for the exam is to practice the easy problems, build your way up to the harder ones, and then improve your speed through repetition. Also, it s a good idea to put some time and effort into creating the page of notes you re allowed to bring in (if this is allowed). Do this as soon as possible so that you can practice problems with your formula sheet. A good place to start in preparing your notes is with the numbered formulas in this handout. However, you need to make sure you know when to use the formulas! It might be beneficial to make a list of what you need to know to use each formula and what you can find out from it. Also, I should point out that in some places I ve provided explanations and/or short proofs to show you where these rules are coming from. You don t need to know how to prove the formulas for the exam, only how and when to use them. However, you may still find it useful to read through some of these proofs. All of these explanations rely upon things you ve learned in the class, so it may help you solidify your understanding. Anyway, feel free to concentrate on the examples for now, but since your book doesn t always go into the details, I want you to have this information available in case you re interested. Good luck on the midterm! Data Basics and Histograms Many quantities in the world around us are random. When we want to study a particular population, we are often interested in the distribution of values within that population. A good example of this is age. At any given time, there is a certain proportion of children, teenagers, young adults, middle aged people, and elderly people. Occasionally someone lives to be 0 and then we get to hear about it on the Today Show. In order to determine how rare such an event is, we can represent a data set s distribution visually using a histogram. A histogram splits the population into a number of groups, much as I did for age ranges above, and then provides a bar graph showing the size of the group. One way to do this is to just make the height of the group equal to the number of people within it. However, when the groups have different widths, a better way to compare groups is by drawing the histogram on the density scale. To convert data to the density scale, divide the total population within a group by the range of the group. The main formulas you ll need are: Density within class interval = Area in class interval Length of class interval Area of class interval Percentage of Data within a class interval = (2) Total Area Teenagers are usually defined as people with age X in [3, 20). If there 7 million teenagers in the population, then this group has a density of million per year. We usually then go on to convert from absolute density to relative density. If the overall population of the country is 00 million, then each one-year interval within ()

Figure : Density Scale Histogram for Population in Statististakia the teenager subgroup contains % of the overall population. By drawing the histogram according to this density, we can immediately compare the size of subgroups according to which has the large volume, and we can say that one subgroup is more crowded than the other if its height is higher. Once we have a histogram on the density scale, we can define the qth percentile as the point x so that the proportion of data points less than x is q percent. For instance, in the above teenager example, suppose that there are 3 million children with age X in [0, 3) and 7 million teenagers with age X in [3, 20). So a total of 0 million people out of the population of 00 million are under the age of 20, which means that we can conclude that the age of 20 years is the 0th percentile of age. Let s also review some other histogram examples: Example of Histograms: Age Total Population (Millions) 0-0 20 0-20 0 20-40 30 40-70 5 70-95 25 The above table depicts the age distribution for the population inhabiting the island of Statististakia. Each group includes the left end-point but not the right. (So, for instance, the first group includes people between the age of 0 and, but does not include anyone 0 or older.) Draw a histogram on the density scale of the age distribution in Statististakia. Don t forget to label its axes! Please see Figure for the solution. We can then go on to count proportions within various subgroups. For examples, let s consider Questions 2-3 from Quiz, which continue with the previous population data: Example: Histogram Group Comparisons. Would you estimate that there are more 2 year olds (people with age 2 age < 3) than 7 year olds (people with age 7 age < 72) on the island of Statististakia?

Estimated number of 2 year olds: 20% 0 0 = 2% Estimated number of 7 year olds: 25% 95 70 = %. We estimate that there are more 2 year olds than 7 year olds in Statististakia. These estimates assume that the population within each age group in the table is evenly distributed. In reality, the number of 2 year olds and 7 year olds may be somewhat more or less than the estimated percentages. The Statististakians want to pair young people between the ages of 2 and 5 (excluding the right endpoint, so 2 age < 5) with middle-aged people between the ages of 42 and 46 (excluding the right endpoint, so 42 age < 46) for a mentoring program. About what percentage of people in Statististakia are eligible to participate? People 2-5: People 42-46: (5 2)0% 20 0 = 3 0% 0 = 3%. (46 42)5% 70 40 = 4 5% 30 = 2% The estimated percentage of the population that is eligible for the program is 3% + 2% = 5%. Speed (mph) Percentage of Total Cars 0-0 20 0-20 0 20-40 40 40-80 20 80-95 0 The above table depicts data collected in a (hypothetical) survey studying the distribution of traffic speed on the Bay Bridge. Each group includes the left end-point but not the right. (So, for instance, the first group includes cars driving 0 mph (not moving) but does not include cars driving exactly 0 mph.) Use this information to answer Questions -3. You may draw a histogram of these data if it helps you, but it is not necessary for answering the questions. Example: Histogram comparisons. Would you estimate that there are more cars driving less than mph than cars driving between 2 and 22 mph? Note: the interval is [2,22). For any class interval, we estimate the proportion of data falling in one of its sub-intervals by the equation Length of Sub-Interval Percent in sub-interval = Percent in class * Length of class interval. ( ) 0 Based upon this, the percent of data in [0, ] is approximately 20% 0 0 = 2%. ( ) Likewise, the percent of data in [2, 22) is approximately 40% = 2%. 22 2 40 20 Therefore, we estimate that both class intervals have the same percentage of all cars on the Bay Bridge within them. As a result, we must answer no, we would not estimate that there are more cars driving less than mph than cars driving between 2 and 22 mph. Example: Adding block components. Approximately what percentage of cars are driving between 30 and 50 mph? The interval is [30, 50). The approximate percentage of cars in [30, 50) is given by the approximate percentage of cars in [30, 40) plus the approximate percentage of cars in [40, 50). These figures are: Approximate percentage in [30, 40) = 40 30 40 20 (40%) = 20%. Approximate percentage in [40, 50) = 50 40 80 40 (20%) = 5%. Combining these values, we estimate that 20% + 5% = 25% of all cars are driving between 30 and 50 mph.

Example: Change of Units. Now suppose the speeds in the table are converted to kilometers (with.6 km = mile). What is the new density of the [80, 95) class after its units are converted from mph to kmph? In units of mph, the [80, 95) class includes 0% of the data, and the length of the interval is 95 80 = 5 units of mph. By converting to km, the length of this interval becomes 5mph.6kmph/mph = 24kmph. The area is still 0%, so under the uniformity assumption, we distribute this area evenly over the class. As a result, each of the 24 one-unit sub-intervals within the class receive 0% 24 = 0.467%. This means that a proportion of approximately 0.467% (or 0.00467 in decimal) of all cars are driving in any one-km-unit block of speed within the class interval that originally consisted of [80, 95) in terms of mph. Example: Combining Blocks and Calculating Percentiles. The traffic committee decides that it would be more accurate to combine the [0,20) class with the [20,40) class to form a [0,40) class in the histogram. Based upon this update, estimate the 55th percentile of speed. Don t forget to include the proper units in your answer! This problem involves two steps. First, we must combine the [0, 20) class interval with the [20, 40) class interval to form a single [0, 40) class interval. When we do so, the percentage of total cars in this interval becomes 0% + 40% = 50%, and the length of the interval becomes 40 0 = 30, so the new density of the class interval is 50% 30 = 5 3 %. Having done this, we must then recognize that 20% of the total cars are driving less than 0 mph, and a total of 20% + 50% = 70% of the total cars are driving less than 40 mph. We define the 55th percentile as the speed at which 55% of all cars are driving less than that speed. Because speed is a continuous variable, we can estimate percentiles on the histogram by assuming that the data are evenly distributed within each class interval. We have determined that the 20th percentile is 0 mph, and also that the 70th percentile is 40 mph. Therefore, the 55th percentile lies between these values. Starting from the left, we can say that we require 35% of area to come from the interval [0,x). We can solve for x as follows: x = 35% 50% (40mph 0mph) = 2mph. This is the number of mph over 0, so the 55th percentile is 0 + x = 0mph + 2mph = 3mph. 2 Means, Standard Deviations, and Standard Units 2. Definitions Suppose you have data X,..., X n. We can define the mean, standard deviation, and standard units measures as follows: Mean: X = n n X i = n (X + X 2 +... + X n ) (3) i= St. Dev: = n [ n (X i X) 2 = n n i= i= X 2 i ] ( X) 2 (4) For the Standard deviation, the first equation is the definition, and the second equation is a computing formula that can sometimes be easier to use. Either equation is fine to use, and we proved in Quiz 2 that they are equivalent. Note that the standard deviation is never negative; we re taking the positive square root. Also remember that a standard deviation of zero means that your data is constant (i.e. not random). St. Units: SU(X i )= X i X Let s run through a quick example. Suppose you have data (X,X 2,X 3,X 4,X 5 ) = (3, 5, 6, 7, 9). We have n = 5 data points. Then X = n i= X i = 5 (3 + 5 + 6 + 7 + 9) = 80 5 = 6. (5)

= = n i= (X i X) 2 = 5 (9 + + 0 + + 9) = 20 5 = 4 = 2 X X SU(X ) = SU(X 4 ) = 0.5, and SU(X 5 ) =.5. 5 [(3 6)2 + (5 6) 2 + (6 6) 2 + (7 6) 2 + (9 6) 2 ] = 3 6 2 =.5. Likewise, you can verify that SU(X 2 ) = 0.5, SU(X 3 ) = 0, We like to use standard units for two reasons: first, it immediately tells us how many standard deviations from the mean a data point is, and second, it is a first step in using the Normal Distribution to estimate percentiles. 2.2 Linear Transformations of Data Suppose we have data (X,..., X n ), and we re actually interested in a linear function of this data. A linear function always has the form Y i = ax i + b, where a and b are fixed real numbers. Then, if we know X and, we can quickly compute Ȳ and SD(Y ) by the following formulas: Ȳ = a X + b if Y i = ax i + b for i n (6) SD(Y )= a if Y i = ax i + b for i n (7) I m going to provide a quick proof of these equations. We only expect you to be able to use them when it s appropriate, but going through the proofs provide good examples of means and standard deviations and might help you understand what s going on. It s OK if you just want to skip to the example. Proof of Equation 6: Ȳ = n i= Y i = n = a n i= X i + n (nb) =a X i + b. Proof of Equation 7: SD(Y )= = = n i= i= [ax i + b] = n [( i= ax i)+ i= b]= n [(a i= X i)+nb] n i= (Y i Ȳ )2 = n ( n i= [axi + b] [ a X + b ]) 2 ( axi a X + b b ) 2 = n ( [ n i= a Xi X ]) 2 ( n n i= a2 X i X ) 2 = a 2 n n i= ( Xi X ) 2 = a n ( n i= Xi X ) 2 = a. Notice that this result does not depend on b. This means that adding a constant to every data point will not affect the standard deviation. Example of Linear Transformations: You collect data on children s height at age 0 and find that the average height is X = 40 inches with a standard deviation of = 4 inches. However, the doctor you re working for has two complaints with the data: first of all, he wanted the data in centimeters, and secondly, the measuring stick was found to be two inches shorter than it s supposed to be. Provide the mean and SD of the children s heights in centimeters. To solve this problem, we first need to create a linear transformation of the data that accounts for the doctor s complaints. Remember that inch = 2.54 centimeters, and we need to add one inch to each child s height. Therefore, we have the equation: Y i =2.54(X i + 2) = 2.54X i +5.08. So a =2.54 and b =5.08. Then Ȳ = a X + b =2.54(40) + 5.08 = 06.68 centimeters. SD(Y ) = a = 2.54(4) = 0.6 centimeters.

Standard Normal Curve P(z) 0.0 0. 0.2 0.3 0.4 A B C z z 4 2 0 2 4 z in Standard Units Figure 2: The Standard Normal Curve. This curve is divided into three regions. A is the region of values less than z, B is the interval [ z, z], and C is the region of values greater than z. Because the Normal Curve is symmetric, the area in A is always equal to the area in C for any value of z. Standard Units as a Linear Transformation: You collect data X,..., X n, and calculate X and. What are the mean and standard deviation of the data in standard units? We convert to standard units via the formula Y = SU(X) = X. and b = Ȳ = a X [ + b = ] X X = X X = 0. Xi X = Xi X. We have a = SD(Y )= a = =. Remember that the standard deviation is never negative, so the absolute value goes away. 3 The Normal Distribution Histograms are a nice way to represent a discrete data set s distribution, but we re often interested in studying the distribution of continuous data. One way to represent a distribution of continuous data is to start by putting data within particular intervals into a group, then drawing a histogram on the density scale for the grouped data. If we make the interval sizes smaller and smaller, eventually we ll end up with a histogram that looks like a curve. This curve can look like practically anything, because all we can say about the data s distribution is that the area under the curve is, and the density at any point cannot be negative. The Normal Distribution is particularly important because many types of continuous data are approximately Normal. The Normal distribution allows the data to take any value on the line of real numbers from to. A Normal distribution may have any mean in this range and any standard deviation greater than 0. However, we will always work with the Standard Normal curve with mean 0 and SD. If the data has some other mean or SD, then you can convert it to Standard Normal by placing the data in Standard Units. The Standard Normal curve is displayed in Figure 2. Notice that the Normal distribution is symmetric about its mean and that the density is largest at the mean. Although the density of the Normal curve is greater than zero at all points on the real line, this density is vanishingly small for any value larger than about 3 or 4 in magnitude. Here are some properties of the Normal distribution. First, it is symmetric about the mean, which implies that the density at any positive point z is equal to that at z, and likewise the area under the curve to the left of z (Region A in Figure 2) is equal to the area under the curve to the right of z (Region C). Because it is a distribution, the area under the Normal curve is always equal to, so we always have the following equations: A + B + C = and A = C (8) This means that knowing the area under the curve in Region B (the interval between z and z) immediately tells us the area in A and C. Ordinarily, we would find the area under the curve using an integral; however, the Normal distribution doesn t have an easy solution to this integral, so we use computers or charts like the

one in the back of your book to get approximations of the area. Let s work a few example problems to get comfortable with using the chart: Example of the Normal Distribution : Region B. You have a data set that follows a Normal distribution. What proportion of data points are within standard deviation of the mean? The Standard Normal distribution is in units of standard deviations, so we are interested in the area under the curve between z = and z =. Refer to the chart in the back of your book and find the row with value z =. The value under the Area heading is about 0.6827. This means that 68.27% of the data is within standard deviation of the mean. Example of the Normal Distribution 2: Region C. In intelligence studies, the intelligence quotient (IQ) is often used to compare people s abilities. Suppose IQ follows a Normal distribution with mean 00 and standard deviation 5. What proportion of people have an IQ of 30 or higher? Let s convert the data to standard units. We are interested in how many people score higher than X = 30, X X and in standard units, this is given by z = = 30 00 5 = 2. So, in the Standard Normal distribution, we want to find the area under the curve in the region C with z = 2. If we refer to the chart in the back of the book, we find that the area in region B for z = 2 is about 0.9545, so we need to apply Equation 8 to find C: =A + B + C = B +2C C = B 2 = 0.9545 2 =0.02275. So about 2.275% of the population has an IQ of 30 or greater. 3. Chebyshev s Inequality Sometimes it s nice to have a baseline figure for what proportion of your data falls in a particular range. When the data follows a Normal distribution, we can use area charts to quantify these proportions with great precision. However, when the data aren t necessarily Normal, we can still use the Chebyshev Inequalities to get a bound on the minimum proportion of data falling within specific ranges. Chebyshev s Inequality: Suppose you have data X. We are interested in the proportion of data points within k standard deviations of the mean. Chebyshev s Inequality says: Percent of data in ( X k, X + k ) k 2. (9) Example of Chebyshev s Inequality: Suppose the SAT math test has a mean score of X = 500 and a standard deviation of = 00. At least what proportion of students score between 350 and 650? Students scoring 350 and 650 are k =.5 standard deviations away from the mean. To see this, you can convert these values to standard units: k = 650 500 00 =.5. Chevyshev s Inequality says the proportion of students scoring between.5 SD s below and above the mean is at least: Proportion of data in(350, 650) k 2 =.5 2 = 5 9. 4 Correlation and Regression We are often interested to find out whether there is a relationship between variables X and Y. Suppose you have data (X,Y ),..., (X n,y n ). The correlation is a measure of association that indicates the strength and direction of a linear relationship. We can define the correlation as: r = n n i= (X i X)(Y i Ȳ ) SD(Y ) = n n SU(X i )SU(Y i ) (0) The correlation is always in the interval r. When r is positive, an increase in X corresponds to an increase in Y (and likewise an increase in Y leads to an increase in X). When r is negative, an increase in i=

X leads to a decrease in Y (and vice-versa). When r = 0, then increasing X has no effect on Y. However, because X and Y are random, the value of Y for a specific value of X has a distribution of values it can take. This is true except when r = or r =, in which case Y is a linear function of X and so knowing one means you can predict the other exactly. Example of Correlation: You have data (X,..., X 5 ) = (, 3, 4, 5, 7) and (Y,..., Y 5 ) = (, 9, 3, 0, 7). X = 4, Ȳ = 0, and =SD(Y ) = 2. Find the Correlation of X and Y. Xi X Let s convert the data into Standard Units. Apply the formula SU(X i )= to get [SU(X ),..., SU(X 5 )] = (.5, 0.5, 0, 0.5,.5), and likewise [SU(Y ),..., SU(Y 5 )] = (0.5, 0.5,.5, 0,.5). Then r = n i= SU(X i)su(y i )= 2.75 5 (.5(0.5) 0.5( 0.5) + 0() + 0.5(0) +.5(.5)) = 5 = 0.55. One of the most important lessons you should take away from this class is that correlation doesn t imply causation. Now we have all the tools we need to turn our attention to prediction, the holy grail of statistics. Prediction is important because it helps us make decisions. In an uncertain world, we have to make decisions based upon the information available to us in a manner that best achieves our goals. When universities decide which applicants to admit, they have to decide what qualities are important and then predict the student s performance in college based upon the information available. One way we can do that is using a regression estimate. In a regression setting, we collect data X,..., X n and Y,..., Y n, where each pair of points (X i,y i ) are collected from the same observation. For instance, we might collect SAT scores and first year college GPAs for n university students. Then, using this data, we want to make a general prediction about the population we re studying. There isn t much point in predicting the GPA of the students we collected data on because they ve already completed their first year. However, the information we get from them can be used to predict first year GPA for the next year s applicants. To do this, we ll construct the regression line: Y = Ȳ + r SD(Y ) (X X) () The regression line tells us an average value of Y given a value of X. Of course, since this is a prediction, we don t know what someone s first year GPA will be just from finding out their SAT score. The GPA follows a distribution centered around the predicted value. Let s work an example: Example: Regression. In a population study of middle-aged men, the average body weight is 80 pounds with a standard deviation of 20, and the average blood pressure is 00 with a standard deviation of 0. The study found a correlation of -0.6 between weight and blood pressure in this population. Suppose a particular middle aged man weighs 200 pounds. Predict his blood pressure. We are trying to predict blood pressure Y based upon body weight X. We have Ȳ = 00, X = 80, SD(Y ) = 0, = 20, and r = 0.6. Plugging in Equation, we predict that a 200 pound middleaged man will have the following blood pressure: Y = Ȳ + r SD(Y ) (X X) = 00 0.6( 0 20 )(200 80) = 94. The regression line tells us an average value for Y given a value of X, but we are also interested in the standard deviation of the values Y can take given the value X. This standard deviation is given by the root mean squared error, or RMS error, which is defined as RMS = SD(Y X) = r 2 SD(Y ) (2) Example: RMS error. Referring to the Regression example above, what is the RMS error for the blood pressure of middle aged men who weigh 200 pounds? Recall that r = 0.6 and SD(Y ) = 0. Then the RMS error is given by:

RMS = SD(Y X = 200) = r 2 SD(Y )= ( 0.6 2 )(0) = 0.8 0 = 8. Notice that the RMS error is different than the overall standard deviation for blood pressure SD(Y ). We re often interested in predicting a range of values that covers a certain proportion of all possibilities for the person it s nice to be able to say that there s a 95% chance that your blood pressure will be in a particular range. Since r, then 0 r 2, and therefore the RMS error is always within the range 0 r 2 SD(Y ) SD(Y ). Since the RMS is never greater than SD(Y ), we can obtain a smaller interval using the RMS error. That s the advantage of using regression. So now we want to determine proportions of data falling into a specific range. Remember that the regression line gives us the mean value of Y for a particular value of X, and the RMS error gives us the standard deviation for Y values at that value of X. When the scatter plot of the data is football shaped, we can assume that different measurements of Y for a particular value of X will follow a normal distribution with that mean and standard deviation. Then we can construct intervals containing the desired proportion of data points using the Normal area table in the back of your book. One of the most important questions to ask yourself in answering regression problems is whether we re talking about an overall population or some sub-population. You can usually determine this from the wording of the question: if we re talking only about values of Y for a specific value of X, then we re using a sub-population. We will refer to this population as Y X, which you should think of in words as Y given X because you know specific information about the value of X. When this is not the case, it s safe to use the entire population. Once you know which population you re interested in for the problem at hand, you can refer to the chart below to obtain this population s mean and standard deviation. For the conditional populations Y X and X Y, we use a point on the regression line as our estimate of the mean and the RMS error as our estimate of the standard deviation. Population Mean Standard Deviation X X Y Ȳ SD(Y ) Y X Ȳ + r SD(Y ) X Y (X X) X + r SD(Y ) (Y Ȳ ) r2 SD(Y ) r2 Example: Regression Prediction Intervals. Let s continue with the example used in the Regression and RMS Error examples. The scatter plot was found to be football shaped. What proportion of 200 pound middle aged men will have a blood pressure between 86 and 02? In the Regression Example, we found that the average blood pressure for 200 pound middle aged men is 94, and in the RMS Error example, we found that the standard deviation for the blood pressure readings of 200 pound middle aged men is 8. We want to find out how many 200 pound middle aged men will have a blood pressure reading between 86 and 02. Let s convert these values to standard units: 86 94 02 94 8 = 8 =. Now we must find the area under the normal curve between and. We look in the back of your book for the value z =, and the corresponding area measurement is 68.27. Therefore, 68.27% of 200 pound middle aged men will have blood pressure readings between 86 and 02. These are the sorts of calculations we can do with regression. However, it is not always appropriate to use regression estimation. The fundamental assumption needed is that the scatter plot of the data is approximately football shaped. We can determine whether the regression line is a good fit to the data by comparing the line we obtain to the data. The difference between each data point and the prediction line for that value of X is the prediction error at that point. In order to determine whether regression is appropriate, we need to compare the distribution of error across different levels of X. Take a look at Figure 3. All the data points lie above the regression line until a little after, then below the regression line until about 4, then are above the regression line for all values of X greater than about 4. Because the errors do not have the same distribution for all levels of X, we say that the errors are heteroscedastic. It is not appropriate to use linear regression when the error are heteroscedastic. Indeed, for Figure 3, I generated the data as random points along the curve Y = X 3, a non-linear relationship. Figure 4 provides an example of when it s appropriate to use linear regression. If we divide the scatter plot of Y versus X into thin, vertical strips, then the error (the difference between the data and the regression

0 2 3 4 5 0 20 40 60 80 00 20 X Y Heteroscedastic Errors > Don't Use Regression Figure 3: Heteroscedastic Errors 0 2 3 4 5 0 5 0 5 20 25 30 X Y0 Homoscedastic Errors > Use Regression Figure 4: Homoscedastic Errors

line) roughly follows the same distribution in every strip. We call this kind of error homoscedastic because all the errors come from the same process. When the errors are homoscedastic, linear regression is OK. 5 More Challenging Problems Normal Curve Area: What is the area under the normal curve between - and 2? Take a look at figure 2. The chart in the back of your book only provides areas for symmetric regions B between the values of z and z, where z is some positive number. However, we can solve this problem by doing two different calculations. First, let s find the area between z = 2 and z = 2. Using the table, this is 95.45%. Then, let s find the area between s = and s =. This is 68.27%. So the question is how to find the area between -2 and -. For this, we use the symmetry of the Normal curve. Basically, there is (95.45 68.27)% = 27.8% that is between -2 and 2 but not between - and. Since the area in [ 2, ] is the same as the area in [, 2], then we divide this difference by 2 to find the area in [ 2, ]. This gives us 27.8%/2 = 3.59% between 2 and. Therefore, the area between and 2 is: 95.45% 3.59% = 8.86% Example: Regression with Percentiles. This is taken from Question 9 on page 78 of the Freedman/Pisani/Purves text. In a large statistics class, the correlation between the midterm scores and final scores is found to be nearly 0.5, every term. The scatter diagrams are football shaped. Predict the percentile rank on the final for a student whose percentile rank on the midterm is 5%. This problem is particularly challenging because ordinarily we need means and standard deviations to do regression. However, using the Normal distribution, we can recover all the information we need just from the percentiles. Now remember that the percentiles correspond to the proportion of students scoring less than a particular value. If the percentile is less than 50, this corresponds to the area of A on the normal curve of Figure 2. (Otherwise you would need to include areas A and B.) The 5th percentile of the Standard Normal distribution is found by remembering that Areas A and C both have 5% in them, so Area B has 90% within it. Then we find the value of 90% in the Area column of the chart in the back of our book, and we pick out the value of z corresponding to it. However, since we re working on the left side of the curve, we ll actually take z. The closest value to 90% in the Area column of the table corresponds to z =.65, so we re actually taking.65 to be the student s midterm score in standard units. Let s refer to that as X =.65. Now we want to do a regression prediction for the student s score on the final in standard units. For this, we have to use the regression line formula given by Equation : Y = Ȳ + r SD(Y ) (X X) =rx =0.5(.65) = 0.825 The second equation reduces the work considerably. We can only do this because both X (the midterm exam score) and Y (the predicted final exam score) are in standard units. Therefore, the means are 0, and the SD s are, leaving us only with rx. The next step is to convert the predicted final score Y to a percentile using the Normal curve. We now have z = 0.825, so we look up the area between 0.825 and 0.825 using the table in the back of the book. The closest figures are either 0.8 or 0.85. Let s use 0.85 just to get a conservative estimate. Then the area B is 57.63%, so the area A is: A = B 2 = 0.6047 2 = 9.765%. This is the student s predicted final exam percentile. In actuality, it s closer to 20.47% if we use a computer to give the area between -0.825 and 0.825 on the Normal curve. Just so we re clear, here s a summary procedure you can use for these percentile regression problems:. Convert the student s midterm percentile to a score in standard units by finding the point on the normal curve with that percentile area to the left of it. 2. Predict the student s final exam score in standard units by applying the regression line equation. 3. Convert the student s predicted score to a percentile by finding the area to the left of that point. Example: Specifying a Normal Distribution with Percentiles. This is taken from Question on

page 06 of the Freedman/Pisani/Purves text. For a certain group of women, the 25th percentile of height is 62.2 inches, and the 75th percentile is 65.8 inches. The histogram follows the normal curve. Find the 90th percentile of the height distribution. The first thing you should remember here is that the mean and standard deviation of the height distribution are unknown. In order to find the 90th percentile, we will need to determine these quantities first. In this case, can rely on the symmetry of the normal curve to find the mean. Because we know the 25th and 75th percentiles, and because these points are equidistant from the mean, we can find the mean of the normal distribution by just averaging the two points we know: X = 62.2+65.8 2 = 64. Now that we know the mean, a natural question to ask is how many standard deviations from the mean the other known points are. Let s work with 65.8, the 75th percentile. What can we say about this value? Let s use the following equation: 65.8 = X + z That is, the 75th percentile is z SD s above the mean. We don t yet know z or, but we can figure out z from the fact that 50% of the data lies between 62.2 and 65.8. Now we just have to find, in standard units, the value of z corresponding to having 50% of the area between z and z on the Standard Normal Curve. Using the table in the back of our books, we search for the Area closest to 50%. The value z =0.7 has Area 5.6%, and the value z =0.65 has Area 48.43%. These points have areas that are almost equidistant from 50%, so I will choose to average them. (However, we won t mind if you just use the closest point on the table.) Therefore, I will select z = 0.675 and say that the 75th percentile is roughly 0.675 standard deviations above the mean. Returning to the above equation, we now have: 65.8 = X + z = 64 + 0.675 = 65.8 64 0.675 = 8 3. Now that we know X and, we just have to find the 90th percentile. Let s call this value X. Putting X in standard units, we have: Z = X X X = Z+ X Because we know X and, we just have to find the value of Z corresponding to the 90th percentile on the Standard Normal Curve. The 90th percentile is the point so that 0% of the data are to the right, so that means 90% is to the left. Of this 90%, 0% is in the left tail and the remaining 80% is in the middle region B. We then have to find the value of Z corresponding to an area of 80% in the chart. The closest value is Z =.3. We can now return to the above equation: Z = X X X = Z+ X =.3( 8 3 ) + 64 = 67.467 inches. This is the 90th percentile of height.