STAT FINAL EXAM

Similar documents
Correlation and Regression

Math 407: Probability Theory 5/10/ Final exam (11am - 1pm)

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

MATH 180A - INTRODUCTION TO PROBABILITY PRACTICE MIDTERM #2 FALL 2018

This paper is not to be removed from the Examination Halls

STAT Chapter 5 Continuous Distributions

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

STAT 430/510 Probability Lecture 12: Central Limit Theorem and Exponential Distribution

EECS 126 Probability and Random Processes University of California, Berkeley: Spring 2018 Kannan Ramchandran February 14, 2018.

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

Chris Piech CS109 CS109 Final Exam. Fall Quarter Dec 14 th, 2017

Math 218 Supplemental Instruction Spring 2008 Final Review Part A

ECE 302, Final 3:20-5:20pm Mon. May 1, WTHR 160 or WTHR 172.

ECE 302 Division 2 Exam 2 Solutions, 11/4/2009.

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

EECS 126 Probability and Random Processes University of California, Berkeley: Spring 2018 Kannan Ramchandran February 14, 2018.

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

Midterm Exam 1 Solution

WEST COVENTRY SIXTH FORM

Final Exam - Solutions

STAT100 Elementary Statistics and Probability

Difference between means - t-test /25

Ph.D. Preliminary Examination Statistics June 2, 2014

Statistics 100 Exam 2 March 8, 2017

MATHEMATICAL METHODS (CAS) PILOT STUDY

INSTITUTE OF ACTUARIES OF INDIA

Notes on Continuous Random Variables

Special distributions

EECS 126 Probability and Random Processes University of California, Berkeley: Spring 2015 Abhay Parekh February 17, 2015.

Sample Problems for the Final Exam

Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.041/6.431: Probabilistic Systems Analysis

MEI STRUCTURED MATHEMATICS STATISTICS 2, S2. Practice Paper S2-B

MATH 151, FINAL EXAM Winter Quarter, 21 March, 2014

(4) Suppose X is a random variable such that E(X) = 5 and Var(X) = 16. Find

IE 230 Probability & Statistics in Engineering I. Closed book and notes. 60 minutes.

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

AP Statistics Final Examination Free-Response Questions

EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran November 13, 2014.

Closed book and notes. 60 minutes. Cover page and four pages of exam. No calculators.

This does not cover everything on the final. Look at the posted practice problems for other topics.

11.5 Regression Linear Relationships

WISE International Masters

STAT 311 Practice Exam 2 Key Spring 2016 INSTRUCTIONS

Time: 1 hour 30 minutes

Midterm Exam 1 (Solutions)

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Pink)

PhysicsAndMathsTutor.com. Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Pink)

LI EAR REGRESSIO A D CORRELATIO

Statistics 253/317 Introduction to Probability Models. Winter Midterm Exam Friday, Feb 8, 2013

Final Exam - Solutions

Department of Statistical Science FIRST YEAR EXAM - SPRING 2017

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

* * MATHEMATICS (MEI) 4767 Statistics 2 ADVANCED GCE. Monday 25 January 2010 Morning. Duration: 1 hour 30 minutes. Turn over

Chapter 1: Revie of Calculus and Probability

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

SDS 321: Practice questions

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Mathematical statistics

Stat 5102 Final Exam May 14, 2015

THE QUEEN S UNIVERSITY OF BELFAST

3 Continuous Random Variables

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Week 9 The Central Limit Theorem and Estimation Concepts

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

Time: 1 hour 30 minutes

Midterm 2 - Solutions

PhysicsAndMathsTutor.com. Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

HW1 (due 10/6/05): (from textbook) 1.2.3, 1.2.9, , , (extra credit) A fashionable country club has 100 members, 30 of whom are

ORF 245 Fundamentals of Statistics Practice Final Exam

EECS 126 Probability and Random Processes University of California, Berkeley: Spring 2017 Kannan Ramchandran March 21, 2017.

STAT 418: Probability and Stochastic Processes

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL - MAY 2005 EXAMINATIONS STA 248 H1S. Duration - 3 hours. Aids Allowed: Calculator

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

1 Basic continuous random variable problems

STATISTICS 1 REVISION NOTES

First Year Examination Department of Statistics, University of Florida

Mathematics 375 Probability and Statistics I Final Examination Solutions December 14, 2009

MAT 2377C FINAL EXAM PRACTICE

Central Limit Theorem ( 5.3)

6 THE NORMAL DISTRIBUTION

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) HIGHER CERTIFICATE IN STATISTICS, 1996

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

Chapter 24. Comparing Means

Test 2 VERSION B STAT 3090 Spring 2017

This gives us an upper and lower bound that capture our population mean.

Review. December 4 th, Review

Section 3: Simple Linear Regression

Machine Learning, Fall 2009: Midterm

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

CONTINUOUS RANDOM VARIABLES

Business 320, Fall 1999, Final

The following formulas related to this topic are provided on the formula sheet:

Time: 1 hour 30 minutes

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Transcription:

STAT101 2013 FINAL EXAM This exam is 2 hours long. It is closed book but you can use an A-4 size cheat sheet. There are 10 questions. Questions are not of equal weight. You may need a calculator for some of the questions but you MUST show your work. For questions that require a one or two line explanation, you will receive no credit if you write three or more lines; being concise matters. If you encounter a question that you do not know how to do, you may receive partial credit if you tell me in plain English (complemented with appropriate probability/statistical symbols if you like) how you would approach the problem. You may use pen or pencil; write legibly. ALL ANSWERS MUST BE ENTERED ON THIS EXAM PAPER. THE BOOKLET IS FOR ROUGH WORK AND WILL NOT BE GRADED OR COLLECTED. Name: IC: Question Points 1 7 2 7 3 5 4 6 5 8 6 7 7 7 8 15 9 20 10 18 Total 100

2013-14 Term 1 1 1. (7 points) Scores in a test follow a normal distribution. The scores have an average of 20.4 and a standard deviation of 3.1. (a) What proportion of the students has a score of over 22? (2 points) P(X > 22) = ( X 20.4 P > 3.1 = P(Z > 0.516) 0.3015. ) 22 20.4 3.1 (b) A student has a score of 18. With the aid of a Z-score, comment on his score relative to others who took the test. (2 points) The student has a Z-score of 18 20.4 3.1 0.77. Hence the student is -0.77 standard deviations below average. (c) Students with a score in the highest 4 percent receive a prize. What is the minimum score (to the nearest integer) to receive a prize? (3 points) The highest 4 percent is the 96-th percentile. Let a be the 96-th percentile. ( X 20.4 P 3.1 P(X a) = 0.04 a 20.4 ) = 0.04 3.1 P(Z z ) = 0.04 z 1.76 a 20.4 + (1.76)(3.1) = 25.85 26.

2013-14 Term 1 2 2. (7 points) According to the American Red Cross, the following are the percentage of Americans with their blood types: Blood Type Percentage O positive 36 O negative 6 A positive 38 A negative 6 B positive 8 B negative 2 AB positive 3 AB negative 1 (a) What is the chance that two randomly chosen Americans are both of AB blood type, assuming they are not related? (2 points) Let W i be the event that person i, i = 1, 2, has AB blood type. Then P(W i ) = 0.03 + 0.01 = 0.04. Since they are not related, and hence independent, P(W 1 W 2 ) = P(W 1 )P(W 2 ) = 0.04 0.04 = 0.0016. (b) What is the chance that in two randomly chosen Americans, at least one of them has AB negative blood type, assuming they are not related? (2 points) Let U i be the event that person i, i = 1, 2, has AB negative blood type. Note that U 1, U 2 are independent because we assume the chosen individuals are independent. Then P(U i ) = 0.01. Hence, P(U 1 U 2 ) = P(U 1 ) + P(U 2 ) P(U 1 U 2 ) = P(U 1 ) + P(U 2 ) P(U 1 )P(U 2 ) = 0.01 + 0.01 (0.01)(0.01) = 0.0199. (c) Write an expression for the probability that in 30 randomly chosen, unrelated Americans, between 20 and 22 are of blood type O positive. YOU CANNOT LEAVE ANY UNSPECIFIED SYMBOLS. THERE IS NO NEED TO EVALUATE YOUR EXPRESSION. (3 points) Let X be the number of O positive individuals in 30 Americans. Then X Bin(30, p = 0.36). Hence P(20 X 22) = 30! 10!20! (0.36)20 (0.64) 8 + 30! 9!21! (0.36)21 (0.64) 9 + 30! 8!22! (0.36)22 (0.64) 8.

2013-14 Term 1 3 3. (5 points) The PDF and CDF of the per capita income (in thousand dollars) in two populations, A and B, are given below: A B 0.59 0.87 0.2 0.54 A B 1 1.8 x 1 1.8 x (a) Which population has a higher average per capita income? In which population is it likely to find an individual with income higher than the average? Explain briefly. (2 points) Population A has both a higher average income and a higher chance of finding an individual with high income because the PDF of A has the bulk of its probability on the right and the area under the curve for higher income is larger. (b) In population A, among individuals with income > $1000, what proportion earns more than $1800? (3 points) P(X > 1.8 X > 1) = P(X > 1.8, X > 1) P(X > 1) = P(X > 1.8) P(X > 1) = 1 0.87 1 0.59 0.317.

2013-14 Term 1 4 4. (6 points) Suppose X, Y have the following joint PDF: X Y 1 2 3 4 10 0.2 0.2 0.11 0.09 20? 0.1 0.09 0.11 (a) Find P(Y 3). (2 points) First of all, we need to work out the missing entry. Since the probabilities in a PDF must add to one, we deduce that the missing entry must be 0.1, from which we can work out the marginal probabilities as well. X Y 1 2 3 4 Total 10 0.2 0.2 0.11 0.09 0.6 20 0.1 0.1 0.09 0.11 0.4 Total 0.3 0.3 0.2 0.2 1 P(Y 3) = P(Y = 3) + P(Y = 4) = 0.2 + 0.2 = 0.4. (b) Find P(Y < 3 X = 10). (2 points) P(Y < 3 X = 10) = = = P(Y < 3, X = 10) P(X = 10) P(Y = 1, X = 10) + P(Y = 2, X = 10) P(X = 10) 0.2 + 0.2 0.6 = 2 3 0.667. (c) Are X and Y independent? Justify. (2 points) Since P(Y 3) = 0.4 P(Y 3 X = 10) = 1 0.667, therefore, X and Y are not independent.

2013-14 Term 1 5 5. (8 points) Angelina has invited 3 friends for a party. However, her friends are not very reliable. Let W be the number of friends that will show up. Furthermore, W has the following PDF: W 0 1 2 3 P(W ) 0.2 0.3 0.1 0.4 (a) Find E(W ). (2 points) E(W ) = (0)(0.2) + (1)(0.3) + (2)(0.1) + (3)(0.4) = 1.7. (b) Angelina has made a cake that can be evenly divided among everyone at the party. Find the expected size of each person s share, as a proportion of the cake. (3 points) The total number of individuals at the party is W + 1 and each person has 1/(W + 1). ( ) 1 E W + 1 = (1/1)(0.2) + (1/2)(0.3) + (1/3)(0.1) + (1/4)(0.4) 0.48. (c) The party is set to start at 3 pm. By 3:05 pm, one of the friends has arrived. What is the new expected number of friends at the party? (3 points) Let Z be the total number of friends at the party. The distribution of Z is given by P(W W 1) which is the following: Z 1 2 3 P(Z) 0.3 0.8 0.1 0.8 0.4 0.8 Hence E(Z) = (1) ( 3 8) + (2) ( 1 8) + (3) ( 4 8) = 17 8.

2013-14 Term 1 6 6. (7 points) Suppose the inter-arrival time S (in days), in a type of events has an exponential distribution with λ = 1.3, where 1/λ is the average inter-arrival time. (a) Find P(S > 2). (2 points) P(S > 2) = 1 F (2) = 1 (1 e 2λ ) = e 2.6 0.074. (b) Find P(4 < S < 5 S > 2). (2 points) P(4 < S < 5 S > 2) = P(2 + 2 < S < 2 + 3 S > 2) = P(2 < S < 3) = e 2.6 e 3.9 0.054. (c) Find P(X = 10), where X is the number of events in a week. (3 points) Due to the relationship between a Poisson and an exponential distribution, if Y is the number of events in a day, then Y P oisson(λ = 1.3). We are interested in the number of events in one week, assuming rate is constant over time, then X P oisson(λ = 1.3 7 = 9.1) P(X = 10) = 9.110 e 9.1 10! 0.12.

2013-14 Term 1 7 7. (7 points) A new café has been opened across the street from campus. Every day, a professor sends her assistant to buy a cup of flat-white. There are two servers in the café, Nero and Grigio. The waiting time (in minutes) for a flat-white follows a Uniform(a, b) distribution. When Nero serves, a = 1, b = 4 and when Grigio serves, a = 2, b = 5. There is a 1/4 chance that a customer will be served by Nero. (a) What is the probability that the assistant has to wait for more than 3 minutes for a flat-white? (2 points) Let T be the times to be served and let N be the event that Nero serves and N C be the event that Grigio serves. P(T > 3) = P(T > 3 N)P(N) + P(T > 3 N C )P(N C ) ( = 1 3 1 ) ( (1/4) + 1 3 2 ) (3/4) 4 1 5 2 = 7 12. (b) If the assistant is served in 3 minutes, find the chance he is served by Grigio. (3 points) P(N C T 3) = P(T 3, N C ) P(T 3) = P(T 3 N C )P(N C ) P(T 3) ( ( ) 3 2 3 = 5 2) 4 1 7 12 = 3 5. (c) Compare (b) to the prior probabilities and comment on whether the answer in (b) is reasonable. (2 points) Yes, since Grigio takes on average longer to serve, and the time is less than 3 minutes, which is below the average time Grigio takes, so the data favors Nero and that reduces the probability of Grigio from a prior of 3/4 to 3/5.

2013-14 Term 1 8 8. (15 points) A family of distributions is used to study the income of farmers in a particular region. The distributions in this family share a parameter λ and for income X (in thousand dollars) from a distribution in the family, the PDF is given by 1 f(x) = 6 λ4 x 3 e λx, 0 < x < 0, otherwise The shape of the PDF for two different values of λ is shown here. Furthermore,. 0 E(X) = 4 λ, var(x) = 4 λ 2. x λ = 4 λ = 8 1 2 3 4 (a) One farmer from the region is randomly selected and interviewed. The farmer reports an income of 1,300 dollars. Based on likelihood principle, which of the two estimates of λ, 4 vs. 8, do you prefer? (2 points) Based on likelihood principle: L(1.3 λ = 4) = 1 6 44 (1.3) 3 e (4)(1.3) = 0.517, L(1.3 λ = 8) = 1 6 84 (1.3) 3 e (8)(1.3) = 0.045. We prefer λ = 4 because the likelihood is much higher. (b) More data are subsequently collected and using the data, the following information from a log-likelihood is known. What is the best estimate of λ based on the given information? Justify your choice. (2 points) λ l(λ) 6.0-4378.439 6.1-4373.518 6.2-4369.673 6.3-4366.868 6.4-4365.071 6.5-4364.251 6.6-4364.378 6.7-4365.423 The MLE, ˆλ, is the value with the highest log-likelihood, which is 6.5. (c) Using the result in part (b), estimate the average income and standard deviation of income. (2 points)

2013-14 Term 1 9 E(X) = 4ˆλ = 4 6.5, var(x) = 4ˆλ2 = 4 6.5. 2 (d) The data used to derive (b) and (c) come from a sample of 1,000 farmers, independently observed. Furthermore, the MLE of the population average income is actually X, the sample average. Based on these information, form a 95% confidence interval of the population average income and determine whether there is sufficient evidence that the population average income is above 600 dollars. (4 points) Since we know the MLE for E(X) is X, then a 95% confidence interval is: X ± 1.96 var( X) = X var(x) ± 1.96 n X var(x) ± 1.96 n 4 6.5 2 = 4 6.5 ± 1.96 1000 0.61538 ± 0.01907 596 to 634 dollars. Since the lower confidence limit is below 600 dollars, there is not sufficient evidence that the average income is above 600 dollars. (e) In (b), the log-likelihood comes from a specific set of data. The same method can be used to obtain an estimate using any set of independent income from n farmers, X 1,..., X n. Write down an expression of the log-likelihood function based on X 1,..., X n. (3 points) L(X 1,..., X n λ) L(λ) = 1 6 λ4 (X 1 ) 3 e X 1λ... 1 6 λ4 (X n ) 3 e Xnλ = n 1 i=1 6 λ4 (X i ) 3 e X iλ log L(λ) l(λ) = n log( 1 i=1 6 λ4 (X i ) 3 e Xiλ ) = n n n log 6 + 4n log λ + 3 log X i λ X i. i=1 i=1

2013-14 Term 1 10 (f) The log-likelihood in (e) shows that, for n observations, the estimate of λ depends only on the value of n i=1 X i (or equivalently X). Does a large X favor a large or small estimate of λ? Justify briefly. (2 points) A large X favors a small estimate of λ. The reason is the PDF shows that a large value of λ corresponds to a distribution that unlikely will produce large values of X, and hence large values of X from a sample.

2013-14 Term 1 11 9. (20 points) In a psychology experiment, two groups of children are used to compare the average time (in minutes) to complete two tasks. The data from the experiment are given below: Task 1 Task 2 n 1 = 40 X 1 = 4.6 n 2 = 90 X2 = 3.2 s 2 1 = 17 s 2 2 = 12.2 where X j, s 2 j = n j i=1(x ij X j ) 2 /n j denote, respectively, the sample mean and variance for the j-th task, j = 1, 2. Assume children work on their tasks independently. (a) Formulate the hypotheses to determine whether there is a difference in the average time in completing the tasks. Carefully define all quantities in your hypotheses. State whether your hypotheses are one- or two-sided and your reason. (3 points) Let µ 1, µ 2 be the average times to complete the two tasks. The hypotheses of interest are: H 0 : µ 1 = µ 2 vs. H 1 : µ 1 µ 2 H 0 : µ 1 µ 2 = 0 vs. H 1 : µ 1 µ 2 0. The hypotheses are two-sided because we are simply looking for a difference. (b) Sketch the distribution of the data expected under the null hypothesis. Label the center of that distribution and mark the region(s) that lead(s) you to suspect the null hypothesis may not be true. (3 points) suspect null 0 Value of X1 X 2 suspect null (c) Under no assumptions other than those given above, test the hypotheses and draw your conclusion. State any known results you use in your test. (3 points) Since the sample size is big enough, we can use CLT, hence, even without further assumptions, under the null hypothesis, X1 X 2 N(µ 1 µ 2 = 0, σ 2 1/n 1 + σ 2 2/n 2 ), where σ 2 1, σ 2 2 are the variances of times to complete tasks 1 and 2, respectively. Hence

2013-14 Term 1 12 the test statistic is: z = ( X 1 X 2 ) 0 σ1/n 2 1 + σ2/n 2 2 ( X 1 X 2 ) 0 s 2 1/n 1 + s 2 2/n 2 4.6 3.2 = 17/40 + 12.2/90 1.87. Since z = 1.87 < 1.96, which is the critical value for a 5% 2-sided test, there is not sufficient evidence to conclude that the average times are different. (d) Subsequently, you are told that the variances of the times to complete the tasks can be assumed to be identical. Analyze the given information carefully, repeat the test and draw a conclusion. (4 points) Since we are told the variances are the same, we can use a pooled variance estimate. Hence we assume X 1 X 2 N(µ 1 µ 2 = 0, σ 2 /n 1 + σ 2 /n 2 ), where σ 2 is the common variance of times to complete the tasks. We pool the samples and come up with a pooled variance estimate the following way: s 2 = (n 1)(s 2 1) + (n 2 )(s 2 2) n 1 + n 2 = (40)(17) + (90)(12.2) 40 + 90 = 13.67692. The new test statistic now becomes: z = ( X 1 X 2 ) 0 σ 2 /n 1 + σ 2 /n 2 ( X 1 X 2 ) 0 s 2 /n 1 + s 2 /n 2 4.6 3.2 = 13.67692/40 + 13.67692/90 1.99. Since z = 1.99 > 1.96, there is now sufficient evidence to conclude that the average times are different. (e) Compare the results in (c)-(d) and comment on the differences. (3 points). The results are different due to a new estimate of the common variance in (d). Notice that since the first sample has a much smaller sample size, its variance estimate s 2 1 is not as accurate. When the two variance estimates are very different, the result becomes significant when we can introduce a better estimate of the common variance, i.e., s 2 2 based on a larger sample.

2013-14 Term 1 13 (f) Assume the times to complete the tasks follow exponential distributions. Repeat the test, draw a conclusion and write a short comparison of your answer here to those in (c) and (d). (4 points) Based on this new information, notice that the estimates of the means are still the same, since the MLE of the mean based on exponential data is the sample mean. Under the null hypothesis, µ 1 = µ 2 = E(X) = 1/λ, where λ is the common parameter in an Exp(λ) distribution. Since the variance of an Exp(λ) distribution is var(x) = 1/λ 2 = E(X) 2, we want to determine the MLE of λ under the null hypothesis. Under the null hypothesis, all the data follow the same exponential distribution, hence the MLE of E(X) = 1/λ is E(X) = n X 1 1 + n 2 X2 = n 1 + n 2 (40)(4.7) + (90)(3.2) 40 + 90 = 3.661538. Hence by the invariance property, the MLE for the variance is Now the test statistic becomes var(x) = E(X) 2 = 3.661538 2 = 13.40686 = s 2 exp. z = ( X 1 X 2 ) 0 s 2 exp /n 1 + s 2 exp/n 2 4.6 3.2 = 13.40686/40 + 13.40686/90 2.01. Since z = 2.01 > 1.96, there is also sufficient evidence to conclude that the average times are different. Notice that this is the best test among the three because this test uses the most information.

2013-14 Term 1 14 10. (18 points) An industrialist, Momofuku, wants to build a new factory. He has done some research and there are two possible locations for the factory. However, he has not decided on how to choose between them. His research contains some data on his existing 20 factories. The data are observations of productivity [PRODUCTIVITY, in log(value added/employees)], the number of businesses per 1,000 population in the surrounding 50 km 2 [BUSINESS DENSITY, in log(number/1,000)] and the proportion of the population with at least a high school diploma in the surrounding 50 km 2 [LABOR SUPPLY, in log(proportion)]: Some summary statistics of the data are given in the following table: Variable ni=1 X ni=1 i Xi 2 ni=1 X i Y ni=1 i Y ni=1 i Yi 2 Y PRODUCTIVITY 52.8 168.8 X BUSINESS DENSITY -33.3 90.3-73.2 LABOR SUPPLY -32.9 68.6-76.3 (a) Suppose the assumptions of a simple linear regression are satisfied between each of the two X variables and Y (PRODUCTIVITY). Fit a simple linear regression for predicting PRODUCTIVITY using LABOR SUPPLY and write your model to 2 decimal places. THERE IS NO NEED TO CARRY OUT HYPOTHESIS TESTS HERE. (3 points) n = 20, X = 32.9 20 = 1.645, Ȳ = 52.8 20 = 2.64, ni=1 X i Y i n ˆb XȲ 76.3 20 ( 1.645) 2.64 = ni=1 = = 0.7290307 Xi 2 n( X) 2 68.6 20 ( 1.645) 2 â = Ȳ ˆb X = 2.64 0.7290307 ( 1.645) = 3.839256. Therefore, to 2 decimal places, Ŷ = 3.84 + 0.73X. (b) The following table gives the simple linear regression results between Y and the two X variables, where columns â and ˆb give the estimates of a and b in a linear regression Y = a + bx + e; so your answer in part (a) are A and B in the table. The numbers in brackets are SD(â) and SD(ˆb). Fill in the entries under the column Test statistic and use the accompanying t-table to determine which X variable(s) could be used to predict Y. Demonstrate carefully the process using LABOR SUPPLY as an example. For BUSINESS DENSITY, just write your answer in the table. Mark with an asterisk the X variable(s) you believe could be used to predict Y. (4 points) Variable â ˆb Test statistic BUSINESS DENSITY 3.34 0.42 (0.40) (0.19) LABOR SUPPLY A B (0.53) (0.29)

2013-14 Term 1 15 t-table df 18 19 20 21 22 23 30 >120 critical value 2.101 2.093 2.086 2.080 2.074 2.069 2.042 1.96 Using the results from (a) and the table: z = ˆb SD(ˆb) = 0.73 0.29 = 2.52. We can compare z to a critical value of 2.101, based on df = n 2 = 18, from the t-table. Since z > 2.101, we conclude there is some evidence that PRODUCTIVITY is related to LABOR SUPPLY. Variable â ˆb Test statistic BUSINESS DENSITY 3.34 0.42 2.21 (0.40) (0.19) LABOR SUPPLY 0.83 0.73 2.52 (0.53) (0.29) Questions (c)-(e) are based on the results of (b) and the following figures and table. The figures show the scatter plots of the data, superimposed with 95% confidence and prediction intervals, based on linear regressions between the two X variables and PRODUCTIVITY. The dotted grey line is drawn for ease of referencing. The table gives additional data for the two locations. PRODUCTIVITY 2 0 2 4 6 PRODUCTIVITY 2 0 2 4 6 4 2 0 1 BUSINESS DENSITY 3.5 2.5 1.5 0.5 LABOR SUPPLY

2013-14 Term 1 16 Location BUSINESS DENSITY LABOR SUPPLY A 1.1 2.5 B 1.1 1.5 (c) Momofuku wants to be at least 95% sure that PRODUCTIVITY is above zero (in log-value). Which location should he choose? State the reason and illustrate carefully how you obtain your answer by marking at appropriate locations on the figures. (4 points) We use the prediction intervals, which are given by the dashed (red) curves, since we are interested only in one location with a particular predictor. Since we have two possible predictors, and from (b), both are possible models we can use. Hence, we will compare locations A and B, in terms of productivity, using each predictor in turn. We first look at BUSINESS DENSITY. For location A, BUSINESS DENSITY= 1.1, hence we look for the 95% confidence interval. Since the lower limit is above 0, we can be 95% sure that it has PRODUCTIVITY above 0. For location B, BUSINESS DENSITY= 1.1, but even so, the lower limit is still above zero, hence we also can be 95% certain that PRODUCTIVITY is above 0 at B. The answers are illustrated on the figure on the left panel. Turning to LABOR SUPPLY, for location A, the lower limit is below 0, indicating we cannot be 95% certain that PRODUCTIVITY will be above 0. For location B, the lower limit remains above 0. Hence based on this analysis, location B is preferred. Summarizing the results, we prefer location B because using either predictor gives us 95% certainty that PRODUCTIVITY will be above zero. PRODUCTIVITY 2 0 2 4 6 PRODUCTIVITY 2 0 2 4 6 4 2 0 1 BUSINESS DENSITY 3.5 2.5 1.5 0.5 LABOR SUPPLY (d) In a regression analysis, we often use the percent variation explained to describe the usefulness of our model. Find the percent variation explained in the model using

2013-14 Term 1 17 LABOR SUPPLY as the predictor. Furthermore, use the accompanying figure (which may not be based on your model) to illustrate what has NOT been explained by the model your created. (4 points) PRODUCTIVITY 2 0 2 4 6 PREDICTOR r = = ni=1 X i Y i n XȲ ni=1 Xi 2 n X ni=1 2 Yi 2 nȳ 2 76.3 20 ( 1.645) 2.64 68.6 20 ( 1.645) 168.8 2 20 (2.64) = 0.511. 2 R 2 can be calculated by squaring r, R 2 = r 2 = 0.511 2 0.26. The value of R 2 in this regression is rather small, indicating the regression model fit is only modest. Furthermore, the per cent variation explained is 0.26 100% = 26% We can use the accompanying figure to illustrate what we mean by some of the variation is not explained by the model. Since fitting model (the red line) the predicted PRODUCTIVITY can still deviate from the observations, which means the model is no perfect. Such is the unexplained variation. In this model, the unexplained variation is quite high so the remaining errors are quite substantial.

2013-14 Term 1 18 PRODUCTIVITY 2 0 2 4 6 PREDICTOR (e) Suppose we want to estimate PRODUCTIVITY for two different types of location that have BUSINESS DENSITY = 4 and 1, respectively. Which type of location ( 4 or 1) would you expect to have a larger margin of error, assuming the current set of data is used? Explain briefly with the aid of the figures in (c). (3 points) A confidence interval s width tells us the margin of error in our estimate. In the context of a regression, among other things, the distance of the predictor from the average plays a role. Since 4 is very far away from the average of 1.645 in the data, we expect the margin of error for estimating PRODUCTIVITY at 4 to be worse. This point is illustrated on the set of solid (blue) curves, which fan out for values of BUSINESS DENSITY away from 1.645, in both directions

0 z* Areas under the standard normal curve beyond z, ie., shaded area z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641 0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247 0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859 0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483 0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121 0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776 0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451 0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148 0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867 0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611 1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379 1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170 1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985 1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823 1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681 1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559 1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455 1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367 1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294 1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233 2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183 2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143 2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110 2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084 2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064 2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048 2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036 2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026 2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019 2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014 3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010 END