Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Similar documents
What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

= X+(z.95)SE(X) X-(z.95)SE(X)

Unit 9 Regression and Correlation Homework #14 (Unit 9 Regression and Correlation) SOLUTIONS. X = cigarette consumption (per capita in 1930)

Practice problems from chapters 2 and 3

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

Sampling Distributions: Central Limit Theorem

Unit 9 ONE Sample Inference SOLUTIONS

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

Tests for Population Proportion(s)

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

BIO5312 Biostatistics Lecture 6: Statistical hypothesis testings

Inference for Distributions Inference for the Mean of a Population

Chapter 7. Practice Exam Questions and Solutions for Final Exam, Spring 2009 Statistics 301, Professor Wardrop

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 9.1-1

Tables Table A Table B Table C Table D Table E 675

Chapter 2 Solutions Page 15 of 28

(x t. x t +1. TIME SERIES (Chapter 8 of Wilks)

(a) The density histogram above right represents a particular sample of n = 40 practice shots. Answer each of the following. Show all work.

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

Chapter 9. Inferences from Two Samples. Objective. Notation. Section 9.2. Definition. Notation. q = 1 p. Inferences About Two Proportions

STAT 350 Final (new Material) Review Problems Key Spring 2016

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Survey on Population Mean

BINF 702 SPRING Chapter 8 Hypothesis Testing: Two-Sample Inference. BINF702 SPRING 2014 Chapter 8 Hypothesis Testing: Two- Sample Inference 1

Relax and good luck! STP 231 Example EXAM #2. Instructor: Ela Jackiewicz

Biostatistics Quantitative Data

Statistics 100 Exam 2 March 8, 2017

Topic 2 Part 3 [189 marks]

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

Hypothesis testing: Steps

A is one of the categories into which qualitative data can be classified.

11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should

Exercises from Chapter 3, Section 1

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

AP Final Review II Exploring Data (20% 30%)

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

EXAM # 2. Total 100. Please show all work! Problem Points Grade. STAT 301, Spring 2013 Name

Chapter 5 Formulas Distribution Formula Characteristics n. π is the probability Function. x trial and n is the. where x = 0, 1, 2, number of trials

Analysis of Variance. Contents. 1 Analysis of Variance. 1.1 Review. Anthony Tanbakuchi Department of Mathematics Pima Community College

Chapter 2: Tools for Exploring Univariate Data

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) STATA Users

Hypothesis testing: Steps

PubHlth 540 Estimation Page 1 of 69. Unit 6 Estimation

Review of Multiple Regression

Statistical Inference for Means

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

1) Answer the following questions with one or two short sentences.

Exercise 1. Exercise 2. Lesson 2 Theoretical Foundations Probabilities Solutions You ip a coin three times.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

MAT 2379, Introduction to Biostatistics, Sample Calculator Questions 1. MAT 2379, Introduction to Biostatistics

E509A: Principle of Biostatistics. GY Zou

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

Correlation & Simple Regression

BE540W Estimation Page 1 of 72. Topic 6 Estimation

One-sample categorical data: approximate inference

Using SPSS for One Way Analysis of Variance

The Difference in Proportions Test

Descriptive Statistics Class Practice [133 marks]

INTRODUCTION TO ANALYSIS OF VARIANCE

Final Exam - Solutions

Basic Statistics Exercises 66

T test for two Independent Samples. Raja, BSc.N, DCHN, RN Nursing Instructor Acknowledgement: Ms. Saima Hirani June 07, 2016

Inference for the Regression Coefficient

QUEEN S UNIVERSITY FINAL EXAMINATION FACULTY OF ARTS AND SCIENCE DEPARTMENT OF ECONOMICS APRIL 2018

Math 10 - Compilation of Sample Exam Questions + Answers

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Simple Linear Regression: One Qualitative IV

16.400/453J Human Factors Engineering. Design of Experiments II

Examples of frequentist probability include games of chance, sample surveys, and randomized experiments. We will focus on frequentist probability sinc

CHAPTER 10. Regression and Correlation

Chapter 23: Inferences About Means

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

Elementary Statistics

Introduction to Basic Statistics Version 2

Unit 2 Regression and Correlation WEEK 3 - Practice Problems. Due: Wednesday February 10, 2016

Inference with Simple Regression

PubHlth 540 Fall Summarizing Data Page 1 of 18. Unit 1 - Summarizing Data Practice Problems. Solutions

MATH4427 Notebook 4 Fall Semester 2017/2018

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

Population 1 Population 2

Binary Logistic Regression

Unit 1 Summarizing Data

In order to carry out a study on employees wages, a company collects information from its 500 employees 1 as follows:

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Outline for Today. Review of In-class Exercise Bivariate Hypothesis Test 2: Difference of Means Bivariate Hypothesis Testing 3: Correla

Exam 2 (KEY) July 20, 2009

Review of Statistics 101

The Chi-Square Distributions

Ordinary Least Squares Regression Explained: Vartanian

Chapter 6. Estimates and Sample Sizes

Essential Question: What are the standard intervals for a normal distribution? How are these intervals used to solve problems?

The t-statistic. Student s t Test

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Math 58. Rumbos Fall More Review Problems Solutions

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Unit 1 Summarizing Data

Transcription:

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 1 of 16 Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users #1. The following table lists length of stay in hospital (days) for a sample of 25 patients. 5 10 6 11 5 14 30 11 17 3 9 3 8 8 5 5 7 4 3 7 9 11 11 9 4 By any means you like (e.g. StatKey, R or Stata, construct a frequency/relative frequency table for these data using 5-day class intervals. Include columns for the frequency counts, relative frequencies, and cumulative frequencies. Stata Solution (comments begin with an asterisk and output is shown in blue) Tip For this exercise, the easiest way to get the data into Stata is to first put it into an Excel file and then do a copy and paste. Be sure that in creating your excel file that you format the column containing your data as numeric.. *** Initialize variable prior to pasting in data from EXCEL. generate los=.. *(1 variable, 25 observations pasted into data editor). *** At this point, I clicked on the icon for DATA EDITOR and completed the paste. *** I then closed the data editor window so as to return to the command line.. *** From the toolbar at top, do a FILE > SAVE to save the data. Stata then shows:. save "/Users/cbigelow/Desktop/BIOSTATS 640 hw01 practice 1.dta" file /Users/cbigelow/Desktop/BIOSTATS 640 hw01 practice 1.dta saved. *** Obtain min and max for using in setting 5-day class intervals. tabstat los, statistics(min max) variable min max -------------+-------------------- los 3 30 ----------------------------------. *** Create grouped measure of los with 5-day class intervals. generate losgroup=los. recode losgroup (min/5=1) (6/10=2) (11/15=3) (16/20=4) (21/25=5) (26/max=6) (losgroup: 25 changes made). label define groupf 1 "1-5 days" 2 "6-10 days" 3 "11-15 days" 4 "16-20 days" 5 "21-25 days" 6 "26-30 days". label values losgroup groupf

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 2 of 16. *** Frequency/relative frequency table. tabulate losgroup losgroup Freq. Percent Cum. ------------+----------------------------------- 1-5 days 9 36.00 36.00 6-10 days 9 36.00 72.00 11-15 days 5 20.00 92.00 16-20 days 1 4.00 96.00 26-30 days 1 4.00 100.00 ------------+----------------------------------- Total 25 100.00. save "/Users/cbigelow/Desktop/BIOSTATS 640 hw01 practice 1.dta", replace file /Users/cbigelow/Desktop/BIOSTATS 640 hw01 practice 1.dta saved #2. The following table lists fasting cholesterol levels (mg/dl) for two groups of men. Group 1: 233 291 312 250 246 197 268 224 239 239 254 276 234 181 248 252 202 218 212 325 Group 2: 344 185 263 246 224 212 188 250 148 169 226 175 242 252 153 183 137 202 194 213 By any means you like (e.g. Statkey, R or Stata, construct the following graphical comparisons of the two groups: 2a. Side-by-side box plot 2b. Side-by-side histograms with same definitions (starting value, ending value, tick marks, etc) of the horizontal and vertical axes. In 1-2 sentences, compare the two distributions. What conclusions do you draw?

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 3 of 16 Stata Solution (notice that I issued the command clear to clear the working memory before proceeding) Tip Play with this. I played with a number of graphs before settling on what I wanted as my final answer.. clear. set scheme lean1. *** Initialize variables prior to pasting in data from EXCEL. generate group=.. generate ychol=.. *(2 variables, 40 observations pasted into data editor). save "/Users/cbigelow/Desktop/BIOSTATS 640 hw01 practice 2.dta" file /Users/cbigelow/Desktop/BIOSTATS 640 hw01 practice 2.dta saved. *** 2a) Side by side box plot. sort group. graph box ychol, over(group) title("fasting Cholesterol (mg/dl)") ytitle("mg/dl") ylabel(100(50)400). *** 2b) Side by side histogram. histogram ychol if group==1, bin(5) start(135) fraction addlabels title("fasting Cholesterol") subtitle("group 1") name(group1, replace) (bin=5, start=135, width=38). histogram ychol if group==2, bin(5) start(135) fraction addlabels title("fasting Cholesterol") subtitle("group 2") name(group2, replace) (bin=5, start=135, width=41.8). graph combine group1 group2. save "/Users/cbigelow/Desktop/BIOSTATS 640 hw01 practice 2.dta", replace file /Users/cbigelow/Desktop/BIOSTATS 640 hw01 practice 2.dta saved Side by side Box Plot Side by side Histogram Comparison of the distributions: Men in Group 1 tend to have higher fasting cholesterol values, as reflected in the upward shift in location of data points. The variation in fasting cholesterol is slightly greater for men in Group 2; this is most easily seen in the side-by-side box plot which shows a larger interquartile range (the size of the box itself) and a more distant outlier.

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 4 of 16 #3. Consider the following setting. Seventy-nine firefighters were exposed to burning polyvinyl chloride (PVC) in a warehouse fire in Plainfield, New Jersey on March 20, 1985. A study was conducted in an attempt to determine whether or not there were short- and long-term respiratory effects of the PVC. At the long term follow-up visit at 22 months after the exposure, 64 firefighters who had been exposed during the fire and 22 firefighters who where not exposed reported on the presence of various respiratory conditions. Eleven of the PVC exposed firefighters had moderate to severe shortness of breath compared to only 1 of the non-exposed firefighters. Calculate the probability of finding 11 or more of the 64 exposed firefighters reporting moderate to severe shortness of breath if the rate of moderate to severe shortness of breath is 1 case per 22 persons. Show your work. Answer: 0.000136 This is a binomial distribution calculation with n=64 π=(1/22) = 0.04545 and x > 11. The solution is then 64 64! x Probability[X 11] = 0.04545 1-0.04545 x=11 x! ( 64-x )! I then used the VassarStats calculator (see footer) ( ) 64-x http://vassarstats.net/binomialx.html

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 5 of 16 #4. The Air Force uses ACES-II ejection seats that are designed for men who weigh between 140 lb and 211 lb. Suppose it is known that women s weights are distributed Normal with mean 143 lb and standard deviation 29 lb. 4a. What proportion of women have weights that are outside the ACES-II ejection seat acceptable range? Answer: 46.8% http://davidmlane.com/hyperstat/z_table.html

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 6 of 16 4b. In a sample of 1000 women, how many are expected to have weights below the 140 lb threshold? Answer: 459 since 459 = 1000 women x (.4588 likelihood of weight below 140 lb) http://davidmlane.com/hyperstat/z_table.html

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 7 of 16 #5. Consider the setting of a single sample of n=16 data values that are a random sample from a normal distribution. Suppose it is of interest to perform a type I error α = 0.01 statistical hypothesis test of H O : µ > 100 versus H A : µ < 100, α = 0.01. Suppose further that σ is unknown. 5a. State the appropriate test statistic Answer: Student t with degrees of freedom = 15. This is a one sample setting of normally distributed data where the population variance is not known and interest is in the mean. Because the sample size is 16, the degrees of freedom is (16-1)=15. 5b. Determine the critical region for values of the sample mean X. Answer: X ( S/4)( -2.602 ) + 100 Given: (1) Solution for n=16, α=0.01,one sided(left), µ O=100 SE ˆ = S S 16 = 4 t = t = -2.602 (2) Solution for CRITICAL.01;df=15 (3) Solution for X CRITICAL is obtained by its solution in the following expression: t t OBSERVED NULL CRITICAL X - µ NULL SEˆ -2.602 X - µ (SE)(-2.602) ˆ X (SE)(-2.602) ˆ + µ X (S/4)(-2.602) + µ ( )( ) NULL NULL X S/4-2.602 + 100

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 8 of 16 #6. An investigator is interested in the mean cholesterol level µ of patients with myocardial infarction. S/he drew a simple random sample of n=50 patients and from these data constructed a 95% confidence interval for the mean µ. In these calculations, it was assumed that the data are a simple random sample from a normal distribution with known variance. The resulting width of the confidence interval was 10 mg/dl. How large a sample size would have been required if the investigator wished to obtain a confidence interval width equal to 5 mg/dl? Answer: 200 (1) Solution for value of confidence coefficient: 95% CI and σ known à Desired confidence coefficient is 97.5 th percentile of Normal(0,1) = 1.96 (2) Expression for CI width: width = (upper limit) - (lower limit) 1.96σ 1.96σ = X + - X - n n 1.96σ 1.96σ = - - n n (2)(1.96)σ = n (3) Using known width=10 and known n=50, obtain σ = 18.0384 (2)(1.96)σ 10 = n (10)( n ) = σ (2)(1.96) (10)( 50) = σ (2)(1.96) σ = 18.0384

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 9 of 16 (3) Using known width=5 and σ = 18.0384 known, obtain n=200 (2)(1.96)σ 5 = n (2)(1.96)σ n = 5 (2)(1.96)(18.0384) n = 5 n = 14.1421 n = 199.99 Or 200, by rounding up. #7. In (a) (d) below, you may assume that the data are a simple random sample (or samples) from a normal distribution (or distributions). Each setting is a different setting of confidence interval estimation. In each, state the values of the confidence coefficients (recall these will be the values of specific percentiles from the appropriate probability distribution). 7a. For a single sample size of n=15 and the estimation of the population mean µ when the variance is unknown using a 90% confidence interval, what are the values of the confidence coefficients? Answer: -1.761 and + 1.761 For 90% confidence, want the 5 th and 95 th percentile of Student t w df=14.

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 10 of 16 7b. For a single sample size n=35 and the estimation of a variance parameter what are the values of the confidence coefficients? 2 σ using a 95% confidence interval, Answer: 19.81 and 51.97 For 95% confidence, want the 2.5 th and 97.5 th percentile of chi square w df=34 7c. For a single sample size of n=25 and the estimation of the population mean µ when the variance is known using a 80% confidence interval, what are the values of the confidence coefficients? Answer: -1.282 and + 1.282 For 80% confidence, want the 10 th and 90 th percentile of Normal(0,1)

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 11 of 16 7d. For the setting of two independent samples, one with sample size n 1 = 13 and the other with sample size n 2 = 22, it is of interest to construct a 90% confidence interval estimate of the ratio of the two population variances, [ σ /σ ]. What are the values of the confidence coefficients? 2 2 1 2 Answer: 0.44 and 2.56 For 90% confidence, want the 5 th and 95 th percentiles of F distribution with numerator df =(13-1) = 12 and denominator df= (22-1) = 21. See below. These values are: 5 th percentile = 0.39 95 th percentile = 2.25 Solution for Lower Confidence Coefficient Multiplier = [ 1 / 95 th percentile ] = [ 1/2.25 ] = 0.44 Solution for Upper Confidence Coefficient Multiplier = [ 1 / 5 th percentile ] = [ 1/0.39 ] = 2.56 Solution for F-Distribution Percentiles Using Stata function invftail - it works w RIGHT tail areas Solution for F-Distribution Percentiles Using Daniel Soper F-Distribution Calculator - it works with RIGHT tail areas

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 12 of 16 #8. A study was investigated of length of hospital stay associated with seat belt use among children hospitalized following motor vehicle crashes. The following are the observed sample mean and sample standard deviations for two groups of children: 290 children who were not wearing a seat belt at the time of the accident plus 123 children who were wearing a seat belt at the time of the accident. Group Sample size, n Sample mean Sample standard deviation Seat belt = no n NO = 290 X NO = 1.39 days S NO = 3.06 days Seat belt = yes n YES = 123 X YES = 0.83 days YES S = 2.77 days You may assume normality. You may also assume that the unknown variances are equal. Construct a 95% confidence interval estimate of the difference between the two population means. In developing your answer, you may assume that the population variances are unknown but EQUAL. 8a. What is the value of the point estimate? Answer: 0.56 [ ] [ ] Point estimate of µ 1 - µ 2 = X 1 - X 2 = 1.39-0.83 = 0.56 8b. What is the value of the estimated standard error of the point estimate? Answer: 0.32 (1) Preliminary: Obtain 2 S pool 2 2 2 1 1 2 2 pool 1 2 ( ) (n -1)S + (n -1)S (289) 9.3636 + (122)(7.6729) S = = = 8.8617377 (n -1) + (n -1) (289) + (122) (2) Solution for SEˆ X 1 - X2

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 13 of 16 2 2 Spool Spool 8.8617377 8.8617377 SEˆ X 1 - X 2 = + = + = 0.320319 n n 290 123 1 2 8c. What is the value of the confidence coefficient? Answer: 1.966 1.96 is fine, too. For 95% confidence, want the 2.5 th and 97.5 th percentile of Student t w df=411. Note With df=411 being so large, the Normal(0,1) could be used, too.

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 14 of 16 8d. What are values of the lower and upper limits of the confidence interval? Answer: [ -0.07, + 1.19 ] ( ) ( ) CI = [X -X ] ± t SE(X ˆ -X ) 1 2.975;df=411 1 2 = 0.56 ± 1.966 (0.32) = [ -0.06912, +1.18912 ] 8e. Write a clear interpretation of the confidence interval. With 95% confidence, from these data, it is estimated that the difference in average length of stay (non-seat belt wearers minus seat belt wearers) is between -0.07 days and +1.19 days. Since this interval includes 0, these data do not provide statistically significant evidence that the length of hospital stay for children in motor vehicle crashes who were not wearing seat belts is different than the length of hospital stay for children in motor vehicle crashes who were wearing seat belts. #9. A test consists of multiple choice questions, each having four possible answers, one of which is correct. What is the probability of getting exactly four correct answers when six guesses are made?. Answer:.03 This is a binomial probability calculation. N=6 π=.24 Want Pr[ X = 4] n 6 x 4 x n-x 4 2 Pr[X=4] = π (1-π) =.25 (.75) =.032959

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 15 of 16 #10. After being rejected for employment, woman A learns that company X has hired only 2 women among the last 20 new employees. She also learns that the pool of applicants is very large, with an approximately equal number of qualified men and women. Help her address the charge of gender discrimination by finding the probability of getting 2 or fewer women when 20 people hired under the assumption that there is no discrimination based on gender. Does the resulting probability really support such a charge? Answer:.0002 This is a very small probability. As such, it would support a charge of gender discrimination but only under the circumstances where, for each position filled, there were an equal number of men and women applicants. This is also a binomial probability calculation. Here, n=20 π=.50 Want Pr[ X < 2] 2 2 n x n-x 20 x 20-x Pr[X 2] = π (1-π) =.50 (.50) =.00020122 x=0 x x=0 x #11. Suppose the length of newborn infants is distributed normal with mean 52.5 cm and standard deviation 4.5 cm. What is the probability that the mean of a sample of size 15 is greater than 56 cm? Answer:.001 This is a normal distribution probability calculation. σ 4.5 X n=15 is distributed Normal with µ =52.5 and se(x X n=15) = = = 1.1619 15 15 X n=15 - µ X 56-52.5 Want Pr[X n=15 56] = Pr = Pr[ Z-score 3.0123 ] =.001296 se( Xn=15 ) 1.1619

BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 16 of 16 #12. Suppose that 25 year old males have a remaining life expectancy of an additional 55 years with a standard deviation of 6 years. Suppose further that this distribution of additional years life is normal. What proportion of 25 year-old males will live past 65 years of age? Answer:.99 This is also a normal distribution probability calculation. X is distributed Normal with µ = 55 and σ = 6 Living past 65 years of age corresponds to a remaining life expectancy of an additional 40+ years. Thus, want: X - µ 40-55 Pr[X 40] = Pr = Pr[ Z-score -2.5] =.99379 σ 6