Booklet of Code and Output for STAC32 Final Exam December 8, 2014
List of Figures in this document by page: List of Figures 1 Popcorn data............................. 2 2 MDs by city, with normal quantile plot............... 3 3 Reading in the MDs data...................... 4 4 Output from proc univariate for MDs data........... 5 5 Apnea data and dierences (before minus after).......... 6 6 Table of binomial distribution with n = 13, p = 0.5........ 7 7 Data for selecting in SAS...................... 8 8 SAS code for data and means for the writers data......... 8 9 Boxplots for writers data....................... 9 10 SAS ANOVA for writers data.................... 10 11 GPA data............................... 11 12 GPA data: rst regression...................... 11 13 GPA data: regression without SATM................. 12 14 GPA data: regression with only HSGPA and SATV......... 13 15 Mystery R code............................ 13 16 Scatterplot of social worker salaries by experience......... 14 17 Regression and residual plot for predicting salary from experience 15 18 Code and output for Box-Cox transformation of salary...... 16 19 Residual plot from regression of transformed salary........ 17 20 SAS code for reading and summarizing perch data........ 18 21 Obtaining leverages for perch data................. 19 1
Brand Trial Unpopped Orville 1 26 Orville 2 35 Orville 3 18 Orville 4 14 Orville 5 8 Orville 6 6 Seaway 1 47 Seaway 2 47 Seaway 3 14 Seaway 4 34 Seaway 5 21 Seaway 6 37 Figure 1: Popcorn data 2
R> health=read.table("metrohealth.txt",header=t) R> attach(health) R> head(nummds) [1] 349 4042 256 2679 502 2352 R> qqnorm(nummds) R> qqline(nummds) 2 1 0 1 2 0 2000 4000 6000 8000 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Figure 2: MDs by city, with normal quantile plot 3
Some of the health care data. Values are separated by tabs. Actual lines are very long and begin with a city name (the lines wrap here). SAS code is below. City NumMDs RateMDs NumHospitals NumBeds RateBeds NumMedicare PctChangeMedicare MedicareRate SSBNum SSBRate SSBChange NumRetired SSINum SSIRate SqrtMDs "Holland-Grand Haven, MI" 349 140 3 316 127 29533 8.3 11835 34135 13679 8.1 23165 2070 820 18.6815 "Louisville, KY-IN" 4042 340 18 3909 328 173845 3 14606 202485 17013 3 118920 29017 2416 63.5767 "Battle Creek, MI" 256 184 3 517 372 22972 2.4 16539 27245 19615 3.3 16645 4095 2945 16 data health; infile '/home/ken/metrohealth.txt' firstobs=2 dlm='09'x; input city $ nummds; Figure 3: Reading in the MDs data 4
proc univariate; var nummds; The UNIVARIATE Procedure Variable: nummds Moments N 83 Sum Weights 83 Mean 1643.3253 Sum Observations 136396 Std Deviation 1981.43175 Variance 3926071.78 Skewness 2.02744075 Kurtosis 3.92289647 Uncorrected SS 546080884 Corrected SS 321937886 Coeff Variation 120.57453 Std Error Mean 217.49039 Basic Statistical Measures Location Variability Mean 1643.325 Std Deviation 1981 Median 844.000 Variance 3926072 Mode 200.000 Range 9267 Interquartile Range 1685 Note: The mode displayed is the smallest of 7 modes with a count of 2. Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 7.555852 Pr > t <.0001 Sign M 41.5 Pr >= M <.0001 Signed Rank S 1743 Pr >= S <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 9410 99% 9410 95% 6050 90% 4612 75% Q3 2018 50% Median 844 25% Q1 333 10% 226 5% 200 1% 143 0% Min 143 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 143 23 6050 61 144 43 6180 39 180 51 7575 81 The UNIVARIATE Procedure Variable: nummds Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 185 55 8107 67 200 37 9410 35 5 Figure 4: Output from proc univariate for MDs data
R> apnea=read.table("apnea.txt",header=t) R> attach(apnea) R> diff=before-after R> cbind(apnea,diff) before after diff 1 1.71 0.13 1.58 2 1.25 0.88 0.37 3 2.13 1.38 0.75 4 1.29 0.13 1.16 5 1.58 0.25 1.33 6 4.00 2.63 1.37 7 1.42 1.38 0.04 8 1.08 0.50 0.58 9 1.83 1.25 0.58 10 0.67 0.75-0.08 11 1.13 0.00 1.13 12 2.71 2.38 0.33 13 1.96 1.13 0.83 Figure 5: Apnea data and dierences (before minus after) 6
The table below shows the probability of obtaining less than or equal to k successes in a binomial distribution with n = 13, p = 0.5. R> k=0:13 R> p=pbinom(k,13,0.5) R> cbind(k,p) k p [1,] 0 0.0001220703 [2,] 1 0.0017089844 [3,] 2 0.0112304688 [4,] 3 0.0461425781 [5,] 4 0.1334228516 [6,] 5 0.2905273438 [7,] 6 0.5000000000 [8,] 7 0.7094726562 [9,] 8 0.8665771484 [10,] 9 0.9538574219 [11,] 10 0.9887695312 [12,] 11 0.9982910156 [13,] 12 0.9998779297 [14,] 13 1.0000000000 Figure 6: Table of binomial distribution with n = 13, p = 0.5 7
data mydata; infile '/home/ken/mydata.txt'; input x y g $; proc print; Obs x y g 1 32.3020 2 a 2 30.8283 6 a 3 29.0993 6 a 4 24.4495 2 b 5 30.3253 3 b 6 24.1334 1 a 7 23.6774 6 c 8 32.6610 9 b 9 27.5017 5 b 10 17.2036 6 b Figure 7: Data for selecting in SAS data writers; infile 'writers.txt'; input genre $ age; proc means; var age; class genre; The MEANS Procedure Analysis Variable : age N genre Obs N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------------------- nonficti 24 24 76.8750000 14.0969084 40.0000000 97.0000000 novelist 67 67 71.4477612 13.0515105 35.0000000 91.0000000 poet 32 32 63.1875000 17.2970956 30.0000000 90.0000000 -------------------------------------------------------------------------------------- Figure 8: SAS code for data and means for the writers data 8
proc boxplot; plot age*genre / boxstyle=schematic; 100 80 age 60 40 20 novelist poet nonficti genre Figure 9: Boxplots for writers data 9
proc anova; class genre; model age=genre; means genre / tukey; The ANOVA Procedure Class Level Information Class Levels Values genre 3 nonficti novelist poet Number of Observations Read 123 Number of Observations Used 123 The ANOVA Procedure Dependent Variable: age Sum of Source DF Squares Mean Square F Value Pr > F Model 2 2744.19300 1372.09650 6.56 0.0020 Error 120 25088.06716 209.06723 Corrected Total 122 27832.26016 R-Square Coeff Var Root MSE age Mean 0.098598 20.55092 14.45916 70.35772 Source DF Anova SS Mean Square F Value Pr > F genre 2 2744.192998 1372.096499 6.56 0.0020 The ANOVA Procedure Tukey's Studentized Range (HSD) Test for age NOTE: This test controls the Type I experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 120 Error Mean Square 209.0672 Critical Value of Studentized Range 3.35614 Comparisons significant at the 0.05 level are indicated by ***. Difference genre Between Simultaneous 95% Comparison Means Confidence Limits nonficti - novelist 5.427-2.736 13.590 nonficti - poet 13.688 4.422 22.953 *** novelist - nonficti -5.427-13.590 2.736 novelist - poet 8.260 0.887 15.634 *** poet - nonficti -13.688-22.953-4.422 *** poet - novelist -8.260-15.634-0.887 *** 10 Figure 10: SAS ANOVA for writers data
R> gpa=read.table("gpa.txt",header=t) R> head(gpa) GPA HSGPA SATV SATM Male HU SS FirstGen White CollegeBound 1 3.06 3.83 680 770 1 3.0 9.0 1 1 1 2 4.15 4.00 740 720 0 9.0 3.0 0 1 1 3 3.41 3.70 640 570 0 16.0 13.0 0 0 1 4 3.21 3.51 740 700 0 22.0 0.0 0 1 1 5 3.48 3.83 610 610 0 30.5 1.5 0 1 1 6 2.95 3.25 600 570 0 18.0 3.0 0 1 1 Figure 11: GPA data R> gpa.1=lm(gpa~hsgpa+satv+satm+male,data=gpa) R> summary(gpa.1) Call: lm(formula = GPA ~ HSGPA + SATV + SATM + Male, data = gpa) Residuals: Min 1Q Median 3Q Max -0.95975-0.27713 0.05058 0.28319 0.89525 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.135e-01 3.283e-01 1.869 0.06301. HSGPA 5.069e-01 7.623e-02 6.650 2.4e-10 *** SATV 1.174e-03 3.940e-04 2.979 0.00322 ** SATM -5.580e-06 4.626e-04-0.012 0.99039 Male 5.534e-02 6.020e-02 0.919 0.35901 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.407 on 214 degrees of freedom Multiple R-squared: 0.2494, Adjusted R-squared: 0.2353 F-statistic: 17.77 on 4 and 214 DF, p-value: 1.298e-12 Figure 12: GPA data: rst regression 11
R> gpa.2=update(gpa.1,.~.-satm) R> summary(gpa.2) Call: lm(formula = GPA ~ HSGPA + SATV + Male, data = gpa) Residuals: Min 1Q Median 3Q Max -0.95990-0.27695 0.05086 0.28309 0.89534 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.6117703 0.2964803 2.063 0.040272 * HSGPA 0.5068283 0.0756565 6.699 1.81e-10 *** SATV 0.0011714 0.0003423 3.422 0.000743 *** Male 0.0550773 0.0560430 0.983 0.326827 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.4061 on 215 degrees of freedom Multiple R-squared: 0.2494, Adjusted R-squared: 0.2389 F-statistic: 23.81 on 3 and 215 DF, p-value: 2.414e-13 Figure 13: GPA data: regression without SATM 12
R> gpa.3=update(gpa.2,.~.-male) R> summary(gpa.3) Call: lm(formula = GPA ~ HSGPA + SATV, data = gpa) Residuals: Min 1Q Median 3Q Max -0.97894-0.27639 0.02867 0.30133 0.87956 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.6351217 0.2955033 2.149 0.03272 * HSGPA 0.4975320 0.0750569 6.629 2.66e-10 *** SATV 0.0012283 0.0003373 3.641 0.00034 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.4061 on 216 degrees of freedom Multiple R-squared: 0.246, Adjusted R-squared: 0.239 F-statistic: 35.23 on 2 and 216 DF, p-value: 5.711e-14 Figure 14: GPA data: regression with only HSGPA and SATV R> my.df=data.frame(hsgpa=3.6,satv=640,satm=670,male=0) R> pp=predict(gpa.3,my.df) R> cbind(my.df,pp) HSGPA SATV SATM Male pp 1 3.6 640 670 0 3.212337 Figure 15: Mystery R code 13
R> socwork=read.table("socwork.txt",header=t) R> attach(socwork) R> plot(salary~experience) R> lines(lowess(salary~experience)) 0 5 10 15 20 25 2e+04 4e+04 6e+04 8e+04 1e+05 experience salary Figure 16: Scatterplot of social worker salaries by experience 14
R> esq=experience*experience R> socwork.1=lm(salary~experience+esq) R> r=resid(socwork.1) R> f=fitted(socwork.1) R> plot(r~f) 20000 30000 40000 50000 60000 70000 10000 0 10000 20000 f r Figure 17: Regression and residual plot for predicting salary from experience 15
R> library(mass) R> boxcox(salary~experience) 95% log Likelihood 50 40 30 20 10 2 1 0 1 2 λ Figure 18: Code and output for Box-Cox transformation of salary 16
R> socwork.2=lm(sal.trans~experience) R> r=resid(socwork.2) R> f=fitted(socwork.2) R> plot(r~f) R> lines(lowess(r~f)) 10.0 10.2 10.4 10.6 10.8 11.0 11.2 0.3 0.2 0.1 0.0 0.1 0.2 f r Figure 19: Residual plot from regression of transformed salary 17
data perch; infile '/home/ken/perch.txt' firstobs=2 expandtabs; input obs weight length width; z=1; proc print; proc means; var weight length width; Obs obs weight length width z 1 104 5.9 8.8 1.4 1 2 105 32.0 14.7 2.0 1 3 106 40.0 16.0 2.4 1 4 107 51.5 17.2 2.6 1 5 108 70.0 18.5 2.9 1 6 109 100.0 19.2 3.3 1 7 110 78.0 19.4 3.1 1 8 111 80.0 20.2 3.1 1 9 112 85.0 20.8 3.0 1 10 113 85.0 21.0 2.8 1 11 114 110.0 22.5 3.6 1 12 115 115.0 22.5 3.3 1 13 116 125.0 22.5 3.7 1 14 117 130.0 22.8 3.5 1 15 118 120.0 23.5 3.4 1 16 119 120.0 23.5 3.5 1 17 120 130.0 23.5 3.5 1 18 121 135.0 23.5 3.5 1 19 122 110.0 23.5 4.0 1 20 123 130.0 24.0 3.6 1 21 124 150.0 24.0 3.6 1 22 125 145.0 24.2 3.6 1 23 126 150.0 24.5 3.6 1 24 127 170.0 25.0 3.7 1 25 128 225.0 25.5 3.7 1 26 129 145.0 25.5 3.8 1 27 130 188.0 26.2 4.2 1 28 131 180.0 26.5 3.7 1 29 132 197.0 27.0 4.2 1 30 133 218.0 28.0 4.1 1 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------ weight 30 120.6800000 53.0654079 5.9000000 225.0000000 length 30 22.1333333 4.0797510 8.8000000 28.0000000 width 30 3.3466667 0.6262826 1.4000000 4.2000000 ------------------------------------------------------------------------------ 18 Figure 20: SAS code for reading and summarizing perch data
proc reg; model z=weight length width / influence; The REG Procedure Model: MODEL1 Dependent Variable: z Number of Observations Read 30 Number of Observations Used 30 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 0 0.. Error 26 0 0 Corrected Total 29 0 Root MSE 0 R-Square. Dependent Mean 1.00000 Adj R-Sq. Coeff Var 0 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 1.00000 0 Infty <.0001 weight 1 0 0.. length 1 0 0.. width 1 0 0.. The REG Procedure Model: MODEL1 Dependent Variable: z Output Statistics Hat Diag Cov ---------------DFBETAS-------------- Obs Residual RStudent H Ratio DFFITS Intercept weight length width 1 0. 0.5818...... 2 0. 0.2134...... 3 0. 0.1177...... 4 0. 0.0926...... 5 0. 0.0719...... 6 0. 0.2173...... 7 0. 0.0799...... 8 0. 0.0681...... 9 0. 0.0939...... 10 0. 0.2246...... 11 0. 0.0915...... 12 0. 0.0528...... 13 0. 0.1226...... 14 0. 0.0375...... 15 0. 0.0848...... 16 0. 0.0649...... 17 0. 0.0439...... 18 0. 0.0398...... 19 0. 0.2990...... 20 0. 0.0559...... 21 0. 0.0449...... 22 0. 0.0447...... 23 0. 0.0536...... 24 0. 0.0732...... 25 0. 0.4217...... 26 0. 0.0812...... 27 0. 0.1640...... 28 0. 0.1574 19...... 29 0. 0.1298...... 30 0. 0.1758...... Sum of Residuals 0 Sum of Squared Residuals 0 Predicted Residual SS (PRESS) 0 Figure 21: Obtaining leverages for perch data