USING ODS W ITH PROC UNIVARIATE Rebecca G. Frederick L ouisiana State U niversity D epartm ent of E xperim ental Statistics South CentralS A S U sers G roup 1 ABSTRACT P ro c U n iv a ria te is u se d b y m a n y sta tistic ia n s to g e t a h a n d le o n th e ty p e o f d a ta th a t is to b e a n a ly z e d. R eg ard le ss o f w h eth er I am co n d u ctin g a R eg ressio n, G e n e ra l L in e a r M o d e l, o r a S u rv iv a l A n a ly s is, I h a v e to investigate the data. O utliers are a natural occurrence in m a n y d a tase ts. A n a ly sis w ith P ro c U n iv a ria te sh o u ld occur first on the data before any other procedure. H ypertext m arkup language or H T M L using the O D S statem ent p erm its th e resu lts to easily sh ared w ith c o lla b o ra to rs. 2 135
INTRO DUCTIO N Typically, a statistician will be handed data of a large number of observations that has been given by a client. The client has assured the statistician that all of the observations have been checked and re-checked for outliers and all problems have been eliminated. But, no data set has been cleaned-up until the data set has been analyzed through SAS Proc Univariate. Further, suppose that the client is located at another location and so, all information will be emailed between the client and the statistician. This is where the SAS ODS statements will come in handy for use. 3 INTRO DUCTIO N(C O N T IN U E ) A quantile is defined as the am ount of area under a density curve or the area to the left o f specified fraction of the total unit area. F or instance, a given value of p is th e pth percentile such that the area to the left of it is p. Q uan tile E stim ate 100% (M ax Q uantile) 94 95% 94 75% (T hird Q uantile) 79 50% (M edian Q uantile ) 77 25% (F irst Q uantile ) 58 0 % (M in Q uantile ) 52 T h e E stim a te s a re fro m th e d a ta se t, Students, and are the T est Scores from a particular section of class. 4 136
ADDITIO NS TO TH E PRO C UNIVARIATE: PROBPLOT, & QQPLOTS If th e d a ta d istrib u tio n m a tc h e s th e th e o retic a l d istrib u tio n, th e p o in ts o n th e p lo t fo rm a lin e a r p a ttern, y = x. T h u s, you can use a Q -Q plot or a probability plot to determ ine h o w w e ll a th e o re tic a l d istrib u tio n m o d e ls a se t o f m easurem ents. For exam ple, G ra p h A : N o L in ea r R e latio n sh ip G ra p h B : L in ea r R e latio n sh ip Y X Y =X y y x x 5 ADDITIONS TO THE PROC UNIVARIATE: PR O BPLO T, & Q Q PLO TS (CONTINUE) T he slope and intercept are visual estim ates of the scale and lo cation param eters of the theo retical distribution. Q -Q plots are m ore convenient than probability plots fo r graphical estim ation o f the location and scale param eters b e c a u se th e a x is o f a Q -Q p lo t is sc a le d lin e a rly. O n the other hand, pro bability plots are m ore convenient for estim ating percentiles or probabilities. 6 137
PR O BPLO T Probplot C reates a probability plot by using highresolution graphs, w hich co m pare ordered v a ria b le v a lu e s w ith p e rc e n tile s o f a sp e c ifie d theo retical d istributio n. S yntax of the statem ent: PR O B PLO T variable(s) / option(s); P ro b p lo t state m e n t o p tio n s c a n req u e st a d istrib u tio n (B eta, E x p o n e n tia l, G a m m a, L o g N o rm a l, N o rm a l, a n d W e ib u ll) and each of the distribution param eters (A lpha, B eta, C, M U, S igm a, S lo pe, T heta, and Z eta) 7 PR O BPLO T (co n tin u e) T he distribution param eters can com pute a m axim um likelihood estim ate by specifying: distribution _param eter= est. P robplot can control the appearance of distribution reference line, general plot layout, enhance the probability plot or com parative plot. IN S E T statem ent P laces a box or table of sum m ary sta tistic s d ire c tly in th e h ig h -re so lu tio n graphics. 8 138
PR O BPLO T (continue) PROGRAM: LIBNAME STUDENT C:\MYSASDIR\SESSION7'; PROC SORT DATA = STUDENT.STUDENTS OUT=SORTED; GOPTIONS HTITLE=2 HTEXT=1 FTEXT=SWISSB FTITLE=SWISSB; SYMBOL VALUE=STAR; PROC UNIVARIATE DATA=SORTED NOPRINT; VAR EXAM; PROBPLOT EXAM /NORMAL(MU=EST SIGMA=EST); INSET MEAN STD / HEADER='Normal Parameters' POSITION=(95,5) REFPOINT=BR; TITLE1 '100 Obs Sampled from a Normal Distribution'; TITLE2 'Normal Probability Plot'; 9 PROBPLOT OUTPUT: (continue) 10 139
QQPLOT Q Q plot C reates a quantile-quantile plot by using highre so lu tio n g ra p h s, w h ic h c o m p a re o rd e re d v a ria b le v a lu e s w ith q u a n tile s o f a sp e c ifie d th e o retic a l d istrib u tio n. Q Q plot statem ent optio ns can request a distribution (B eta, E x p o n e n tia l, G a m m a, L o g N o rm a l, N o rm a l, a n d W e ib u ll) a n d each o f th e d istrib u tio n s p ara m eters(a lp h a, B eta, C, M u, S ig m a, S lope, T heta, and Z eta). T he distribution param eters can com pute a m axim u m likelihood estim ate by specifying: distribution _param eter= est. Q Q plot can control the appearance of distributio n reference line, general plo t layo ut, enhance the probability plot or com parative plo t. 11 QQPLOT (co n tin u e) PR O G R A M : LIBNAME STUDENT C:\MYSASDIR\SESSION7'; PROC SORT DATA = STUDENT.STUDENTS OUT=SORTED; GOPTIONS HTITLE=2 HTEXT=1 FTEXT=SWISSB FTITLE=SWISSB; SYMBOL VALUE=STAR; PROC UNIVARIATE DATA=SORTED NOPRINT; VAR EXAM; QQPLOT EXAM /NORMAL(MU=EST SIGMA=EST); INSET MEAN STD / HEADER='Normal Parameters' POSITION=(95,5) REFPOINT=BR; TITLE1 '100 Obs Sampled from a Normal Distribution'; TITLE2 'Normal Probability Plot'; 12 140
Q Q PLO T O UTPUT (co n tin u e) 13 O UTPUT DELIVERY SYSTEM (O DS) O D S U ser can com bine raw data w ith one or m ore table definitions to produce output to a printer or form atted in H ypertext M arkup L anguage (H T M L ). 14 141
ODS (co n tin u e) O D S breaks dow n the procedures into separate pieces so that the user can print out only sections of the report. O D S, in version 9, can currently supports m any destinations but here are at least four destinatio ns: T he O utput destinatio n produces S A S dataset. T h e L istin g d e stin a tio n p ro d u c e s m o n sp a c e o u tp u t, w h ic h is fo rm a tte d lik e tra d itio n a l S A S p ro c e d u re o u tp u t. T he H T M L destinatio n produces o utput that is fo rm atted in H ypertext M arkup L anguage. T h e P rin te r d e stin a tio n p ro d u c e s o u tp u t th a t is fo rm a tte d fo r high-resolution printers. 15 ODS (co n tin u e) PROGRAM : LIBNAME STUDENT C:\MYSASDIR\SESSION7'; PROC SORT DATA = STUDENT.STUDENTS OUT=SORTED; GOPTIONS HTITLE=2 HTEXT=1 FTEXT=swissb FTITLE=SWISSB; SYMBOL VALUE=STAR; ODS HTML FILE='ODSHTML_BODY.HTM' CONTENTS='ODSHTML_CONTENTS.HTM' PAGE='ODSHTML_PAGE.HTM' FRAME='ODSHTML_FRAME.HTM'; South CentralS A S U sers G roup 16 142
ODS (co n tin u e) PR O G R A M : PROC UNIVARIATE DATA=SORTED NOPRINT; VAR EXAM; QQPLOT EXAM /NORMAL(MU=EST SIGMA=EST); INSET MEAN STD / HEADER='Normal Parameters' POSITION=(95,5) REFPOINT=BR; TITLE1 '100 Obs Sampled from a Normal Distribution'; TITLE2 'Normal Quantile-Quantile Plot'; ODS HTML CLOSE; 17 ODS OUTPUT (co n tin u e) 18 143
CONCLUSION: T h e O u tp u t D e liv e ry S y ste m (O D S ) is a n e x tre m e ly pow erful tool that statisticians can use to com m unicate re su lts o f a n a ly se s w ith c lie n ts. H ig h -re so lu tio n g ra p h ic s o f the probability plots and Q -Q plots display theoretical distributions relative to the actual data. Perform ance of these is displayed through a very sim ple procedure nam ed, SA S Proc U n iv a ria te. 19 REFERENCES: M eeker, W illiam Q. and L uis A. E scobar. (1998), Statistical M ethods For R eliability D ata, N ew Y ork: John W iley & Sons. SA S Institute, Inc (2000). SA S O nlined oc, V ersion 8. C ary, N C : SA S Institute, Inc. SA S Institute, Inc (). SA S H elp and D ocum entation, V ersio n 9. C ary, N C : S A S In stitu te, In c. S outh C entralsas Users G roup 20 144
TRADEM ARKS: S A S is a registered tradem ark or a tradem ark o f S A S Institute Inc. in the U S A and other countries. In d icates U S A reg istratio n. O th er b ran d an d p ro d u ct n am e s are registered tradem arks or tradem arks of their respective com panies. 21 CONTACT INFORM ATION: Y our com m ents and questions are valued and encouraged. C ontact the author at: R ebecca G. F rederick L o uisiana S tate U niversity A gricu ltural C enter D epartm ent of E xperim ental Statistics B aton R ouge, L A 70803-5606 W ork Phone: 225-578-8303 F a x : 2 2 5-5 7 8-8 3 4 4 E m ail: rfred eri@ lsu.ed u http://www.stat.lsu.edu/faculty/frederick South CentralS A S U sers G roup 22 145