Swarthmore Honors Exam 2012: Statistics

Similar documents
Swarthmore Honors Exam 2015: Statistics

Stat 5102 Final Exam May 14, 2015

Final Exam Bus 320 Spring 2000 Russell

SCHOOL OF MATHEMATICS AND STATISTICS

UNIVERSITY OF TORONTO Faculty of Arts and Science

MS&E 226: Small Data

Inferences for Regression

Class 26: review for final exam 18.05, Spring 2014

STAT 285 Fall Assignment 1 Solutions

MATH 644: Regression Analysis Methods

ST430 Exam 2 Solutions

Correlation & Simple Regression

Post-exam 2 practice questions 18.05, Spring 2014

18.05 Final Exam. Good luck! Name. No calculators. Number of problems 16 concept questions, 16 problems, 21 pages

Confidence Intervals, Testing and ANOVA Summary

18.05 Practice Final Exam

MS&E 226: Small Data

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Harvard University. Rigorous Research in Engineering Education

Regression. Marc H. Mehlman University of New Haven

R 2 and F -Tests and ANOVA

ST430 Exam 1 with Answers

Mathematical Notation Math Introduction to Applied Statistics

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

INFERENCE FOR REGRESSION

This does not cover everything on the final. Look at the posted practice problems for other topics.

Inference for Regression

Practice Final Examination

EXAM # 2. Total 100. Please show all work! Problem Points Grade. STAT 301, Spring 2013 Name

SMAM 314 Exam 42 Name

SCHOOL OF MATHEMATICS AND STATISTICS

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Exam Applied Statistical Regression. Good Luck!

Biostatistics 380 Multiple Regression 1. Multiple Regression

Lecture 10: F -Tests, ANOVA and R 2

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Confidence Intervals with σ unknown

Frequentist Statistics and Hypothesis Testing Spring

Exam 2 Practice Questions, 18.05, Spring 2014

STAT 350 Final (new Material) Review Problems Key Spring 2016

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Lecture 18: Simple Linear Regression

CENTRAL LIMIT THEOREM (CLT)

28. SIMPLE LINEAR REGRESSION III

CAS MA575 Linear Models

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

ISQS 5349 Final Exam, Spring 2017.

Chapter 12 - Part I: Correlation Analysis

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Mathematical Notation Math Introduction to Applied Statistics

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

ORF 245 Fundamentals of Engineering Statistics. Final Exam

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Institute of Actuaries of India

ORF 245 Fundamentals of Engineering Statistics. Final Exam

STAT 510 Final Exam Spring 2015

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

AMS-207: Bayesian Statistics

This exam contains 13 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Regression and the 2-Sample t

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

Conjugate Priors: Beta and Normal Spring 2018

Conjugate Priors: Beta and Normal Spring 2018

STAT 350. Assignment 4

16.400/453J Human Factors Engineering. Design of Experiments II

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

9 Correlation and Regression

Algebra 2 Practice Midterm

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Lectures on Simple Linear Regression Stat 431, Summer 2012

Point Estimation. Vibhav Gogate The University of Texas at Dallas

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Bayesian Analysis LEARNING OBJECTIVES. Calculating Revised Probabilities. Calculating Revised Probabilities. Calculating Revised Probabilities

General Linear Model (Chapter 4)

DSST Principles of Statistics

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

First Year Examination Department of Statistics, University of Florida

The point value of each problem is in the left-hand margin. You must show your work to receive any credit, except in problem 1. Work neatly.

Simple linear regression

Fundamental Probability and Statistics

Confidence Intervals for Normal Data Spring 2018

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

PAKURANGA COLLEGE. 12MAT Mathematics and Statistics Practice Exams, Apply calculus methods in solving problems (5 credits)

Chemical Engineering: 4C3/6C3 Statistics for Engineering McMaster University: Final examination

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Written Exam (2 hours)

Business Statistics. Lecture 10: Course Review

STAT 215 Confidence and Prediction Intervals in Regression

Transcription:

Swarthmore Honors Exam 2012: Statistics 1 Swarthmore Honors Exam 2012: Statistics John W. Emerson, Yale University NAME: Instructions: This is a closed-book three-hour exam having six questions. You may not refer to notes or textbooks. You may use a calculator that does not do algebra or calculus. Normal, t, and F tables should be supplied with this exam. Please note: ˆ A few of the questions have a considerable amount of description and background which is intended to help you. Read this material carefully and completely before working on the problem. ˆ Most questions have multiple parts (only question 4 is a single question). Number the questions and parts clearly in your work, and start each of the questions on a new page. ˆ Questions that explicitly ask for (or imply the need for) discussion are a chance to demonstrate your understanding of the material. As a guideline, a short paragraph is likely appropriate and preferable to either a single sentence or an longer essay.

Swarthmore Honors Exam 2012: Statistics 2 1. A basketball coach wants to devise an innovative way to test whether her players have improved by practicing over the summer. The previous season, her star, Julie, made 70% of her free throws. Julie claims to have improved over the summer; she now thinks she is a 80% free throw shooter. The coach doubts Julie has improved; she asks Julie to start shooting free throws. Let X be a random variable corresponding to the number of consecutive free throws Julie makes before her first miss. If we assume Julie s free throws are independent and that she makes each shot with some fixed, unknown probability p, the geometric distribution would seem appropriate: P (X = x) = p x (1 p), for x = 0, 1, 2,... The following table presents these probabilities for two values of p, 0.8 and 0.7. The last row gives the probabilities of making 20 or more shots before the first miss. x P (X = x p = 0.8) P (X = x p = 0.7) 0 0.2000 0.3000 1 0.1600 0.2100 2 0.1280 0.1470 3 0.1024 0.1029 4 0.0819 0.0720 5 0.0655 0.0504 6 0.0524 0.0353 7 0.0419 0.0247 8 0.0336 0.0173 9 0.0268 0.0121 10 0.0215 0.0085 11 0.0172 0.0059 12 0.0137 0.0042 13 0.0110 0.0029 14 0.0088 0.0020 15 0.0070 0.0014 16 0.0056 0.0010 17 0.0045 0.0007 18 0.0036 0.0005 19 0.0029 0.0003 20 0.0115 0.0009 a. Consider testing H 0 : p = 0.7 versus H a : p > 0.7. Derive the rejection region for the test corresponding to significance level α = 0.05. b. What is the probability of a Type I error? Explain this concept in the context of this problem to Julie s coach, who has never studied statistics. c. What is the power of this test against Julie s proposal that she improved and is actually a 80% free throw shooter? Again, explain this concept in the context of this problem to Julie s coach, who has never studied statistics.

Swarthmore Honors Exam 2012: Statistics 3 2. Suppose {X i } is a set of n 1 independent identically distributed (iid) random variables from the uniform distribution on the interval (0, θ) for 0 < θ <. a. Find the maximum likelihood estimator (MLE) for θ. b. Prove that the MLE is a biased estimator for θ. c. Prove that the MLE is a consistent estimator for θ. d. Propose an unbiased estimator for θ. Is this estimator preferable to the MLE? Include a discussion of the basis for your preference, as if you were presenting this solution to a fellow student of mathematical statistics. 3. Suppose X 1, X 2,... are iid Bernoulli random variables such that P (X i = 1) = p, with p some fixed value in (0, 1). Define L 1 and L 2 to be the lengths of the first and second runs, respectively, in the sequence generated by the X i s. A run is a collection of consecutive common outcomes. So, for example, the sequence 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,... would produce L 1 = 3, L 2 = 2, L 3 = 1, and L 4 = 5. a. What is the distribution of L 1? b. What is the distribution of L 2? c. Are L 1 and L 2 independent? Discuss. 4. A coin will be tossed once, and we are interested in estimating the probability of heads, p. A Bayesian will propose using a uniform prior distribution on p, g(p) = 1, 0 p 1, and will estimate p using the mean of the posterior distribution. A frequentist will use the maximum likelihood estimator (MLE) for p. Show how to derive both the MLE and the Bayes estimator. Using squared error loss, describe (exactly or approximately, with justification) the range of values of p for which the MLE is preferable to the Bayes estimator.

Swarthmore Honors Exam 2012: Statistics 4 5. This problem examines data showing the effect of two soporific (sleep-inducing) drugs. The variable extra shows the increase in hours of sleep for each of n = 20 patients who had been randomly assigned to two groups for the study. A few observations are shown here: extra group 1 3.4 0 2 0.8 0 3 1.9 1 4 4.4 1 5-1.2 0 There are n 0 = n 1 = 10 subjects in each group. In terms of underlying probability models, you may assume EXT RA i N(µ 0, σ 2 ) for subject i in group 0, and EXT RA j N(µ 1, σ 2 ) for subject j in group 1. The sample variance of all 20 measurements is where the overall mean is s 2 total = 1 n 1 n (extra i extra) 2 = 4.0720, i=1 extra = 1 n extra i = 1.5400. n i=1 The sample variance of measurements in group 0 is s 2 0 =3.2006, and the variance of measurements in group 1 is s 2 1 =4.009. A pooled 2-sample t-test is performed, making use of the pooled estimate of the variance, s 2 pooled = (n 0 1) s 2 0 + (n 1 1) s 2 1 n 0 + n 1 2 = 3.6048, and giving the following result: Two Sample t-test data: extra by group t = -1.8608, df = 18, p-value = 0.07919 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.363874 0.203874 sample estimates: mean in group 0 mean in group 1 0.75 2.33

Swarthmore Honors Exam 2012: Statistics 5 a. Show how to obtain the confidence interval ( 3.363874, 0.203874) from the information provided, above. Clearly define any quantities used. b. Critique the following statement: The 95% confidence interval ( 3.363874, 0.203874) contains the true difference in drug effectiveness with probability 0.95. c. Consider the linear model EXT RA i = α + β group i + ε i for i = 1, 2,..., 20, where group i takes values 0 and 1 as described above for this data set and standard regression assumptions include that ε i are iid N(0, σ 2 ). We observe extra i in our data set (and have various helpful summary statistics on the previous page of this exam). We fit the model and obtain estimates α and β, provided below. Complete the following regression summary and analysis of variance (ANOVA) table that would be associated with fitting this model using these data. A word of caution: This final part of this problem should not require memorization of general regression or analysis of variance formulae. Almost everything involves means and variances (or associated sums of squares) closely related to quantities shown above. End of word of caution. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.75 group 1.58 Residual standard error: on degrees of freedom Multiple R-squared: Analysis of Variance Table Response: extra Df Sum Sq Mean Sq F value Pr(>F) group 0.07919 Residuals

Swarthmore Honors Exam 2012: Statistics 6 6. This problem examines basketball games from the 2007-08 NBA season. Prior to each game, bookies publish point spreads which are predicted point differentials. For instance, if Team A is thought to be stronger than Team B by a margin of 10 points, the point spread (from the perspective of Team A) would be -10. After each game, of course, we can observe the actual point differential (the score for Team B minus the score for Team A in this example, from the perspective of Team A) and compare this value to the bookies prediction. Thus, an actual point differential of -9 (Team A in fact winning by 9 points) would be very close to the predicted 10 point advantage (the published point spread of -10). A few raw data entries from the perspective of the Boston Celtics (thought to be a stronger team) follow, where -3.5 in the 11/27 game for Boston versus Cleveland (V shows that Boston was visiting Cleveland) indicates that Boston was favored by 3.5 points (and then lost the game, 104-109): Boston versus (3 sample game results): 11/27 Cleveland -3.5 104-109 V 11/29 New York -12.5 104-59 H 11/30 Miami -3.0 95-85 V A few raw data entries from the perspective of the New York Knicks (thought to be a weaker team) follow, where +5.5 in the 11/26 game for New York versus Utah (H shows that it was a home game for New York) indicates that Utah was favored by 5.5 points (but New York won, 113-109): New York versus (3 sample game results): 11/26 Utah +5.5 113-109 H 11/29 Boston +12.5 59-104 V 11/30 Milwaukee +2.0 91-88 H There are 2496 such entries in this data set, yielding the plot on the next page and the regression summary given as Problem 6, regression summary two pages later. Two plots of residuals are also provided for your consideration. A reporter proposes a two-sided hypothesis test of an efficient market theory that the unknown true slope, β, of the relationship between bookies point spreads and the observed game point differential is equal to one. That is, if the gambling market were efficient, the bookies point spreads prior to the game would be unbiased predictors of the game point differentials. a. Find the p-value for the reporter s desired test based on the regression output, Problem 6, regression summary. This is not available directly from the regression output; show your work. Explain to the reporter, who has never studied statistics, the meaning of this p-value. b. In fact, a very sensible solution to part a, receiving full credit above, could still be misleading. What do you think? If there is still a problem, explain whether you think it really matters in terms of the desired test. If you think it is appropriate and feasible (given that you don t have access to a computer), offer an approximate correction of this p-value and discuss any assumptions you make in doing so. If you don t feel it is appropriate or feasible, explain why. In either case, defend your reasoning.

Swarthmore Honors Exam 2012: Statistics 7 20 10 0 10 20 40 20 0 20 40 Bookie Spread (prior to start of game) Observed Score Differential

Swarthmore Honors Exam 2012: Statistics 8 20 10 0 10 20 40 20 0 20 40 Fitted Values versus Residuals Regression Fitted Values Residuals 3 2 1 0 1 2 3 40 20 0 20 40 Normal Quantile Plot of Residuals Theoretical Quantiles Sample Quantiles

Swarthmore Honors Exam 2012: Statistics 9 Problem 6, regression summary. Residual summary: Min 1Q Median 3Q Max -40.177-7.836 0.005 7.845 40.186 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.004692 0.239000-0.02 0.984 pointspread 1.060423 0.029207 36.31 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 11.94 on 2494 degrees of freedom Multiple R-squared: 0.3165, Adjusted R-squared: 0.3162 F-statistic: 1155 on 1 and 2494 DF, p-value: < 2.2e-16 An aside, in case you were curious: You can see evidence of a crossing pattern in the scatterplot. Basketball games can t end in a tie, explaining the lack of 0 values for the observed game score differentials. Less obvious is the fact that bookies don t publish point spreads of -0.5 or 0.5 (and point spreads equal to 0 are more unusual than you might expect). This creates the vertical gap in the plot.