University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

Similar documents
MATH 1150 Chapter 2 Notation and Terminology

are the objects described by a set of data. They may be people, animals or things.

STAT 200 Chapter 1 Looking at Data - Distributions

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

TEST 1 M3070 Fall 2003

AP Final Review II Exploring Data (20% 30%)

Elementary Statistics

Chapter 6 Group Activity - SOLUTIONS

Inference for the Regression Coefficient

Resistant Measure - A statistic that is not affected very much by extreme observations.

Practice Questions for Exam 1

Review. Midterm Exam. Midterm Review. May 6th, 2015 AMS-UCSC. Spring Session 1 (Midterm Review) AMS-5 May 6th, / 24

Chapter 5: Exploring Data: Distributions Lesson Plan

Ch Inference for Linear Regression

Q 1 = 23.8 M = Q 3 = 29.8 IQR = 6 The numbers are in order and there are 18 pieces of data so the median is the average of the 9th and 10th

Math 138 Summer Section 412- Unit Test 1 Green Form, page 1 of 7

You have 3 hours to complete the exam. Some questions are harder than others, so don t spend too long on any one question.

Statistical View of Least Squares

Exam: practice test 1 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Multiple Choice Circle the letter corresponding to the best answer for each of the problems below (4 pts each)

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Statistics 100 Exam 2 March 8, 2017

STATISTICS/MATH /1760 SHANNON MYERS

Mrs. Poyner/Mr. Page Chapter 3 page 1

Stat 101 Exam 1 Important Formulas and Concepts 1

The Normal Distribution. Chapter 6

Practice problems from chapters 2 and 3

1.3.1 Measuring Center: The Mean

Math 140 Introductory Statistics

Math 140 Introductory Statistics

Q1: What is the interpretation of the number 4.1? A: There were 4.1 million visits to ER by people 85 and older, Q2: What percent of people 65-74

Chapter 4: Displaying and Summarizing Quantitative Data

Measures of the Location of the Data

Chapter 5. Understanding and Comparing. Distributions

Lecture 18: Simple Linear Regression

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Lecture 30. DATA 8 Summer Regression Inference

Chapter 3. Measuring data

Stat 20 Midterm 1 Review

Name: JMJ April 10, 2017 Trigonometry A2 Trimester 2 Exam 8:40 AM 10:10 AM Mr. Casalinuovo

Sem. 1 Review Ch. 1-3

Units. Exploratory Data Analysis. Variables. Student Data

Continuous distributions

Lecture 1: Description of Data. Readings: Sections 1.2,

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

Statistics I Chapter 2: Univariate data analysis

Describing distributions with numbers

Section 5.4 Residuals

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Chapter 6 Assessment. 3. Which points in the data set below are outliers? Multiple Choice. 1. The boxplot summarizes the test scores of a math class?

Density Curves and the Normal Distributions. Histogram: 10 groups

Statistics I Chapter 2: Univariate data analysis

Chapter 5: Exploring Data: Distributions Lesson Plan

Chapters 1 & 2 Exam Review

3.1 Measure of Center

11 Correlation and Regression

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

1 Measures of the Center of a Distribution

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Chapter 2: Tools for Exploring Univariate Data

Sections 6.1 and 6.2: The Normal Distribution and its Applications

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math Sec 4 CST Topic 7. Statistics. i.e: Add up all values and divide by the total number of values.

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

QUIZ 1 (CHAPTERS 1-4) SOLUTIONS MATH 119 SPRING 2013 KUNIYUKI 105 POINTS TOTAL, BUT 100 POINTS = 100%

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

AP Statistics Semester I Examination Section I Questions 1-30 Spend approximately 60 minutes on this part of the exam.

Review of Multiple Regression

Recall that the standard deviation σ of a numerical data set is given by

Determining the Spread of a Distribution

Section 2.3: One Quantitative Variable: Measures of Spread

Francine s bone density is 1.45 standard deviations below the mean hip bone density for 25-year-old women of 956 grams/cm 2.

Unit 6 - Introduction to linear regression

Determining the Spread of a Distribution

6.2b Homework: Fit a Linear Model to Bivariate Data

Mean, Median, Mode, and Range

Example 2. Given the data below, complete the chart:

EQ: What is a normal distribution?

Chapter 1. Looking at Data

Chapter 1 - Lecture 3 Measures of Location

The empirical ( ) rule

Continuous distributions

Basic Statistics Exercises 66

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

Chapter 6. Exploring Data: Relationships. Solutions. Exercises:

CHAPTER 1. Introduction

UNIT 12 ~ More About Regression

Chapter 3: Examining Relationships Review Sheet

AP Statistics Summer Assignment

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Unit 6 - Simple linear regression

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Chapter 3: The Normal Distributions

Final Exam - Solutions

Algebra Calculator Skills Inventory Solutions

Recall, Positive/Negative Association:

Transcription:

University of California, Berkeley, Statistics 3A: Statistical Inference for the Social and Life Sciences Michael Lugo, Spring 202 Solutions to Exam Friday, March 2, 202. [5: 2+2+] Consider the stemplot below. 0 579 236 2 48 3 3 4 5 5 6 0 7 8 (a) What is the median of the data represented in this stemplot? There are 3 data points; the median is the (3 + )/2th largest, or 7th largest, which is 6. (b) The mean of the data represented in this stemplot is (circle one): much smaller than about equal to much larger than the median. Explain your answer without any explicit computations. Since the distribution is right-skewed, the mean is much larger than the median. (c) One of the images below is a boxplot for the data given in the stemplot. Circle that boxplot. No explanation is necessary. The bottom of the three boxplots is the correct one. There is a right outlier (corresponding to 78 in the data) and the boxplot is otherwise typical of a right-skewed distribution.

2. [6: 3+3] Below is a histogram for a data set containing nine numbers. Histogram of x Frequency 0.0 0.5.0.5 2.0 0 2 4 6 8 x (a) Can you determine the median of this data set exactly? If you can, do so. If not, explain why not and give the best possible bounds on the median. (For example, the median is clearly between and 8. But you can do better.) There are nine data points, so the median is the fifth smallest. One is between and 2, one between 2 and 3, and two between 3 and 4. The fifth smallest data point is between 4 and 5 but we can t say what it is more precisely than that. (b) Can you determine the mean of this data set exactly? If you can, do so. If not, explain why not and give the best possible bounds on the mean. The smallest data point is between and 2; the second smallest is between 2 and 3; and so on. So the mean is at least ( + 2 + 3 + 3 + 4 + 4 + 5 + 6 + 7)/9 = 36/9 = 4 and at most more than this, or 5. 2

3. [8: 2+2+2+2] Consider the data set of four points given below: x 2 2 3 y 2 2 4 4 (a) Find the standard deviation s x. The mean is ( + 2 + 2 + 3)/4 = 2; the standard deviation is 4 (( 2)2 + (2 2) 2 + (2 2) 2 + (3 2) 2 ) = 2 3. (b) Find the standard deviation s y. The mean is (2 + 2 + 4 + 4)/4 = 3; the standard deviation is 4 ((3 2)2 + (3 2) 2 + (3 4) 2 + (3 4) 2 ) = 4 3. (c) Find the coefficient of correlation r. We have the formula r = n and so Now, plugging in values, r = n i= (4 ) 2/3 4/3 x i x y i ȳ s x s y n (x i 2)(y i 3). r = 8 ( 2)(2 3) + (2 2)(2 3) + (2 2)(4 3) + (3 2)(4 3) = 2 8 = 2. i= (d) What is the equation of the regression line for predicting y from x? The regression line passes through ( x, ȳ) = (2, 3) and has slope rs y /s x =. Thus its equation is y = x +. 3

4. [5: 2+2+] Scores on the math section of the SAT are normally distributed with mean 500 and standard deviation 00. (a) What proportion of math SAT scores are between 60 and 680? Standardizing gives z =., z =.8. So we want Φ(.8) Φ(.) = 0.964 0.8643 = 0.0998. (b) What score is at the 80th percentile of math SAT scores? From the normal table, Φ (0.8) = 0.84. Unstandardizing gives 500 + (00)(0.84) = 584. (c) The proportion of students scoring less than 350 is (circle one): greater than the number scoring at least 630 between the number scoring at least 630 and the number scoring at least 670 less than the number scoring at least 670 The normal distribution is symmetric around its mean, so the number scoring less than 350 (=500-50) is the same as the number scoring greater than 650 (=500+50). 4

Name: 5. [3] Let r M be the coefficient of correlation between the heights and weights of adult men. Let r A be the coefficient of correlation between the heights and weights of all adults. Which of the following is true? Circle one. r M < r A r M = r A r M > r A Explain your answer, using a clearly labeled diagram and/or a few sentences of text. This was intended to be a problem about the restricted range effect. If we know someone s height then knowing their gender doesn t give us much additional information about predicting their weight, so the residuals in the men-only case and in the all-adults case are similar. But the variance of the weights of all adults is much larger than the variance of the heights of men. We recall that r 2 is the variance of the residuals divided by the variance of the response variable. This quotient has larger denominator for all adults, so r 2 M > r2 A ; rearranging (and assuming correlations are positive) gives r M < r A. However, it turns out that there is significant overlap between the distribution of heights and weights of men and that of women, so this doesn t really happen. In fact, from actual data, r M > r A. If you put this, see us and we ll give you back a point. 6. [3] In one study, it was necessary to draw a representative sample of Japanese- Americans resident in San Francisco. The procedure was as follows. After consultation with representative figures in the Japanese community, the four most representative blocks in the Japanese area of the city were chosen. All persons resident in those four blocks were taken for the sample. However, a comparison with Census data shows that the sample did not include a high enough proportion of Japanese with college degrees. How can this be explained? People living within the Japanese community are likely to be less well assimilated to American culture (and perhaps more likely to not be fluent in English). As a result a sample which overrepresents this community will have a lower number of people with college degrees. 5

Name: 7. [6: 3 + + 2] The figure below shows a scatter diagram of the high temperatures in San Francisco (SFO) and Los Angeles (LAX) for each day in 20. 50 60 70 80 90 50 60 70 80 90 LAX temps vs. SFO temps high temperature at SFO high temperature at LAX Q R S S R Q (a) Three lines are drawn, and are labeled Q, R, and S. For each description circle the letter of the line it corresponds to. (i) Estimated average high at LAX, for a given high at SFO Q R S (ii) Estimated average high at SFO, for a given high at LAX Q R S (iii) Nearly equal percentile ranks in both data sets Q R S (b) The coefficient of correlation for these 365 points is closest to circle one - -0.5 0 0.5 (c) The average of the 3 high temperatures for January 20 at SFO was 56.7 degrees; the average of the 3 high temperatures for January 20 at LAX was 68.7 degrees. This gives us the point (56.7, 68.7). We could compute similar points for the other eleven months of the year, and compute the correlation coefficient of these twelve points. The coefficient of correlation of the twelve monthly averages is (circle one) less than equal to greater than the coefficient of correlation of the original 365 daily data points. Briefly explain your answer. This is an ecological correlation; the averaging removes the day-to-day fluctuation and just leaves the seasonal trend, namely that both places are cool in the winter and warm in the summer. 6