A controlled experiment on the effects of PSP training: Detailed description and evaluation

A cntrlled experiment n the effects f PSP training: Detailed descriptin and evaluatin Lutz Prechelt (prechelt@ira.uka.de) Barbara Unger (unger@ira.uka.de) Fakultät für Infrmatik Universität Karlsruhe D-76128 Karlsruhe, Germany +49/721/608-4068, Fax: +49/721/608-7343 http://wwwipd.ira.uka.de/eir/ Technical Reprt 1/1999 April 8, 1999 Abstract The Persnal Sftware Prcess (PSP) is a methdlgy fr systematic and cntinuus imprvement f an individual sftware engineer s sftware prductin capabilities. The prpnents f the PSP claim that the PSP methds imprve in particular the prgram quality and the capability fr accurate estimatin f the develpment time, but d nt impair prductivity. We have perfrmed a cntrlled experiment fr assessing these and related claims. The experiment cmpares the perfrmance f a grup f students that have just previusly participated in a PSP curse t a cmparable set f students frm a nrmal prgramming curse. This reprt presents in detail the experiment design and setup, the results f the experiment, and ur interpretatin f the results. The results indicate that the claims are basically crrect, but the imprvements may be a lt smaller than expected. Hwever, we fund an imprtant additinal benefit frm PSP that is nt usually mentined by the PSP prpnents: The perfrmance in the PSP grup was cnsistently less variable fr mst f the many variables we investigated. Less variable perfrmance in a sftware team greatly reduces the risk in sftware prjects.

Cntents 1 Intrductin 4 1.1 What is the PSP?... 4 1.2 Experiment verview..... 5 1.3 Related wrk......... 6 1.4 Why such an experiment?... 6 1.5 Hw t use this reprt...... 6 2 Descriptin f the experiment 8 2.1 Experiment design...... 8 2.2 Hyptheses.... 9 2.3 Experiment frmat and cnduct.................................. 10 2.4 Experimental subjects..... 11 2.4.1 Overview......................................... 11 2.4.2 Educatin and experience................................ 12 2.4.3 The PSP curse (experiment grup)........................... 14 2.4.4 The alternative curses (cntrl grup)......................... 14 2.5 Task... 15 2.5.1 Gals fr chsing the task................................ 15 2.5.2 Task descriptin and cnsequences...... 15 2.5.3 Task infrastructure prvided t the subjects....................... 16 2.5.4 The acceptance test............. 16 2.5.5 The gld prgram............. 17 2.6 Internal validity... 17 2.6.1 Cntrl... 17 2.6.2 Accuracy f data gathering and prcessing.................. 18 2.7 External validity... 18 2.7.1 Experience as a sftware engineer................ 18 2.7.2 Experience with psp use... 18 2.7.3 Kinds f wrk cnditins r tasks............................ 19 3 Experiment results and discussin 20 3.1 Statistical methds....... 20 3.1.1 One-dimensinal statistics........... 20 3.1.2 Tw-dimensinal statistics................................ 23 3.1.3 Presentatin f results.................................. 25 3.2 Grup frmatin... 25 3.3 Estimatin....... 27 3.4 Reliability and rbustness... 31 3.4.1 Black bx analysis and white bx analysis.................. 31 3.4.2 The test inputs: a, m, and z...................... 32 2

CONTENTS 3 3.4.3 Reliability measures...... 32 3.4.4 Inputs with nnempty encdings................... 33 3.4.5 Arbitrary inputs.... 35 3.4.6 Influence f the prgramming language.................... 36 3.4.7 Summary... 39 3.5 Release maturity... 39 3.6 Dcumentatin........ 42 3.7 Trivial mistakes......... 42 3.8 Prductivity.......... 47 3.9 Quality judgement............. 48 3.10 Efficiency..... 49 3.11 Simplicity..... 51 3.12 Analysis f crrelatins...................................... 51 3.12.1 Hw time is spent... 52 3.12.2 Better dcumentatin saves trivial mistakes....................... 53 3.12.3 The urge t finish.... 53 3.13 Subjects experiences..... 54 3.14 ean/median/iqr verview table....................... 55 4 Cnclusin 59 4.1 Summary f results....... 59 4.2 Pssible reasns... 59 4.3 Cnsequences... 61 Appendix 62 A Experiment materials 62 A.1 Experiment prcedure.......... 63 A.2 Questinnaire persnal infrmatin.............................. 64 A.3 Task descriptin........ 66 A.4 Questinnaire Estimatin.................................... 70 A.5 Questinnaire Self-evaluatin........... 73 A.6 Versuchsablauf... 76 A.7 Fragebgen persönliche Angaben............................... 77 A.8 Aufgabenstellung... 79 A.9 Fragebgen Selbsteinschätzung........... 84 A.10 Fragebgen Eigenbeurteilung..... 87 B Glssary 90 Bibligraphy 92 One item culd nt be deleted because it was missing. Apple acintsh System 7 OS errr message Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

Chapter 1 Intrductin Everybdy has pinins. I have data. Watts S. Humphrey The present reprt is the definitive and detailed descriptin and evaluatin f a cntrlled experiment cmparing students that received PSP (Persnal Sftware Prcess) training t ther students that received ther sftware engineering educatin. In the first chapter we will first discuss the general tpic f the experiment, then give a brad verview f the purpse and setup f the experiment, and finally describe related wrk. Chapter 2 describes the subjects, setup, and executin f the experiment, relying partially n the riginal experiment materials as printed in the appendix. It als discusses pssible threats t the internal and external validity f the experiment. Chapter 3 presents and interprets in detail the results btained in the experiment and Chapter 4 presents cnclusins. The appendix cntains the handuts used in the experiment: persnal data questinnaire, estimatin questinnaire, task descriptin, wrk time lgging sheet, pstmrtem questinnaire. 1.1 What is the PSP? The Persnal Sftware Prcess (PSP) is a methdlgy fr structuring the wrk f an individual sftware engineer intrduced by Watts Humphrey in 1995 [3]. At its cre is the ntin f an individual s sftware prcess, that is, the set f prcedures used by a single sftware engineer t d his r her wrk. The PSP has several gals: Reliable planning capability, i.e., the ability t accurately predict the delivery time f a piece f wrk perfrmed by a single sftware engineer. Effective quality management, i.e., the ability t avid intrducing defects r ther quality-reducing prperties int the wrk prducts, t detect and remve thse that have been intrduced anyway, and t imprve bth capabilities ver time. These ther quality attributes can be, fr instance, the ease f maintenance, reuse, r testing (internal view f the prduct), r the suitability, flexibility, and ease f use (external view), etc. Defining and dcumenting the sftware prcess, i.e., laying dwn in writing the abstract principles and cncrete prcedures by which ne generally creates sftware. The purpse f prcess definitin is imprving the ability f the prcess t be traced, understd, cmmunicated, measured, r imprved. 4

1.2 Experiment verview 5 Cntinuus prcess imprvement, i.e., the capability t cntinuusly identify the relatively weakest pints in ne s wn sftware prcess, develp alternative slutins, evaluating these slutins, and incrprating the best ne int the prcess subsequently. The PSP may als lead t imprved prductivity but this is nt a primary gal; part f the prductivity gains may be ffset by the verhead intrduced by the PSP, because the means t the abve ends are prcess definitin, prcess measurement, and data analysis, which lead t a number f additinal tasks. The PSP methdlgy is taught by means f a PSP curse. In its standard frm, this is a 15-week training prgram requiring rughly ne full day per week. Accrding t the experience f bth Watts Humphrey and urselves, the PSP can hardly be learned withut that curse, because under the pressure f real-wrld wrking cnditins, prgrammers will nly be able t accept and execute the verhead tasks nce they have experienced their benefits but the benefits will nly be experienced after the verhead tasks have been perfrmed fr quite a while. Hence, the curse is needed fr prviding a pressure-free playgrund fr learning abut the usefulness f PSP techniques. The practical cnsequence f the PSP curse fr an individual sftware engineer is t btain a persnal sftware prcess (psp, in nn-capital letters). The curse prvides a set f example methds that serve as a starting pint fr the develpment f an individual psp. The methds are reasnable default instantiatins f the PSP principles and can be tailred t ne s individual preferences and wrk cnditins during later psp usage. 1.2 Experiment verview The questin asked by this experiment is the fllwing: What (if any) differences in behavir r capabilities can be fund when cmparing sftware engineers that have received PSP training t sftware engineers that have received an equivalent amunt f cnventinal technical training? The apprach used t answer this questin is the fllwing: Find participants with similar capabilities and backgrunds, except that ne grup has had previus PSP training and the ther has nt. Let each participant slve the same nn-trivial prgramming task. Observe as many features f behavir (prcess) and result (prduct) as pssible. Examples: ttal wrking time, number f cmpilatins, prgram reliability when prgram was first cnsidered functinal (acceptance test), final prgram reliability, prgram efficiency, etc. Frmulate hyptheses describing which differences might be expected between the grups with respect t the features that were bserved. Example: PSP-trained participants prduce mre reliable prgrams. Analyze the data in rder t test the hyptheses. Describe additinal interesting structure fund in the data, if any. Interpret the results. In ur case, the participants were graduate students and the prgramming task invlved designing and implementing a rather uncmmn search and encding algrithm. On average, the task size was effectively mre than 1 persn day (between 3 and 50 wrk hurs). A ttal f 55 persns participated in the experiment between August 1996 and Octber 1998. Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

6 Chapter 1: Intrductin 1.3 Related wrk It is ne f the mst imprtant principles f the PSP methdlgy t base decisins n bjective measurement data (as ppsed t intuitive judgement). Cnsequently, the PSP curse (and als later practitining f the PSP) builds a cllectin f data fr each participant frm which the develpment f several attributes f prcess quality and effective develper capability can be seen. Such data has been described and discussed by Humphrey in several articles and reprts, e.g. [4]. st f this data shw the develpment f a certain metric ver time, such as the decreasing density f defects inserted in a prgram. The main and unavidable drawback f such data is a lack f cntrl: It is impssible t say hw much f the effect cmes frm each f the pssible surces, such as: the particular prgramming prblem slved at each time, maturatin that wuld als ccur withut PSP training (at least given sme ther training), details f the measurement that cannt be made bjective, and, finally, real and unique PSP/psp benefits. The purpse f the present experiment is prviding data with a much higher degree f cmparability: measuring psp effects in a cntrlled fashin. We knw f n ther evaluatin wrk specifically targeting the PSP methdlgy r the PSP curse. 1.4 Why such an experiment? There are basically tw reasns why we need this experiment. First, any methdlgy, even ne as cnvincing as the PSP, shuld underg a sund scientific validatin. Nt nly t see whether it wrks, but rather t understand the structure f its effects: Which cnsequences are visible at all? Hw strng is their influence? Hw d they interact? The secnd reasn is mre pessimistic: Based n ur bservatins with abut a hundred German infrmatics students we estimate that nly abut ne third f them will regularly use PSP techniques in their nrmal wrk after the curse and will really frm a psp. Rughly anther third appears t be unable t regularly exercise the self-cntrl required fr building and using a psp. Fr the rest, we cnsider the prspects t be unclear; their PSP future may depend n the kind f envirnment in which they will wrk. Given this estimatin it is nt clear whether PSP-trained students will be superir t thers, even if ne is willing t believe that a PSP educatin in principle has this effect. The purpse f the experiment is t assess the average effect as well as lk fr the results f the abve-mentined dichtmy, if any. If it exists, we may fr instance see a larger variance f perfrmance in the PSP grup r maybe even a bimdal distributin having tw peaks instead f just ne. 1.5 Hw t use this reprt This reprt is meant t prvide a mst detailed dcumentatin f the experiment and its results. This has several cnsequences: Yu need nt read the whle reprt frm frnt t back. The cntents are lgically structured and it shuld be easy t find specific infrmatin f interest using the table f cntents. When yu encunter a term whse definitin yu have skipped, refer t the table f cntents r the glssary fr finding it. A cntrlled experiment n the effects f PSP training

1.5 Hw t use this reprt 7 The main text des nt try t describe the tasks r questinnaires in any detail but instead relies n the riginal experiment materials that are printed in the appendix. Please cnsult the appendix where necessary. The results sectin is rather detailed. We recmmend t stick t the text and t refer t the diagrams and their captins nly at pints f particular interest. Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

Chapter 2 Descriptin f the experiment Gd judgement cmes frm experience, and experience cmes frm bad judgement. Annymus This chapter will describe the experiment design, the hyptheses t be investigated by the experiment, the experiment prcedure, the mtivatin and backgrund f the participants, and the task t be slved. In the final sectins we will discuss pssible threats t the internal and external validity f the experiment. 2.1 Experiment design As mentined befre, the general gal f the experiment is t investigate differences in perfrmance r behavir between persns that have received PSP training shrtly befre and ther peple. The resulting basic experimental design is very simple: it is a tw-grup, psttest-nly, inter-subject design with a single binary independent variable, namely subject 1 has received PSP training. We will call the tw grups P (fr PSP-trained ) and N (fr nt PSP-trained ). The set f dependent variables t be used is nt at all simple, thugh. T avid unnecessary repetitin, these variables will be intrduced step by step during the presentatin f the results in Chapter 3. In rder t maximize the pwer f the experiment in view f the large variatins in individual perfrmance that are t be expected, ne wuld ideally want t pair participants with similar expected perfrmance (based n available knwledge abut the participants backgrund) and put ne persn f each pair int each grup. Unfrtunately, this is nt pssible in the present setting because grup membership is determined by the previus university career f each participant that can nt be cntrlled by the experimenter. See als the fllwing sectin. 1 We will use the terms subject and participant interchangeably t refer t the persns wh participated in ur experiment as experimental subjects. 8

2.2 Hyptheses 9 2.2 Hyptheses As mentined in the verview in Sectin 1.2, the general purpse f the experiment is investigating behavir differences (and their cnsequences) between PSP-trained and nn-psp-trained subjects. T satisfy such a brad ambitin, we must cllect and analyze data as cmprehensively as is feasible (frm bth a technical/rganizatinal and a psychlgical pint f view). Hwever, t guide the evaluatin f this data it is useful t frmulate sme explicit expectatins abut the differences that might ccur. These expectatins are frmulated in this sectin in the frm f hyptheses. The experiment shall investigate the fllwing hyptheses: Hypthesis H1: Estimatin. Since effrt estimatin is a majr cmpnent f the PSP curse, we expect that the deviatins f actual frm expected develpment time are smaller in grup P cmpared t grup N. (Nte that this hypthesis assumes that the task can be cnsidered t be frm a familiar dmain, because therwise the PSP planning may break dwn and the results becme unpredictable.) Hypthesis H2: Reliability. Since defect preventin and early defect remval are majr gals thrughut the PSP curse, we expect the reliability f the prgram fr nrmal inputs t be higher in grup P cmpared t grup N. Hypthesis H3: Rbustness. Since prducing defect-free prgrams is an imprtant gal during the PSP curse, we als expect the reliability f the prgram fr surprising srts f inputs t be higher in grup P cmpared t grup N. (Nte that the term rbustness is smewhat misleading here because rbustness against illegal inputs was explicitly nt a requirement in the experiment.) Hypthesis H4: Release maturity. We expect that grup P will typically deliver prgrams in a relatively mre mature state cmpared t grup N. When the requirements are invariant, release maturity can be represented by the fractin f verall develpment time that cmes after the first prgram release. (In the experiment, releasing a prgram will be represented by the request f the participant that an acceptance test be perfrmed with the prgram.) Hypthesis H5: Dcumentatin. Since the PSP curse puts quite sme weight nt making a prper design and design review and nt prduct quality in general, we expect that there will be mre dcumentatin in the final prduct in grup P cmpared t grup N. Hypthesis H6: Trivial mistakes. Quality management as taught in the PSP curse is based n the principle t take even trivial defects seriusly, hence we expect a lwer number f simple-t-crrect defects in grup P cmpared t grup N. Hypthesis H7: Prductivity. Fr prgrams that are difficult t get right, the PSP fcus n early defect detectin saves a lt f testing and debugging effrt. The verhead implied by planning and defect lgging usually des nt utweigh these savings. What might utweigh the savings, thugh, is if participants prduce a thrugh design dcumentatin that accmpanies the prgram, e.g. in the frm f lng cmments in the surce cde. Such behavir is als expected t be mre likely fr PSP-trained participants. Nte that since all prgrams were built t cnfrm t the same requirements, prductivity can be defined simply as 1 divided by the ttal wrk time. Nw fr the actual hypthesis: We expect grup P t cmplete the prgram faster cmpared t grup N, at least if ne subtracts the effrt spent fr dcumentatin. Hypthesis H8: Quality judgement. The PSP quality management tracks the density f defects in a prgram as seen during develpment and even as predicted befre develpment starts. We speculate that this might als lead t mre realistic estimates f the defect cntent and reliability f a final prgram. Hence, we expect t see mre accurate estimates f final prgram reliability in grup P cmpared t grup N. Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

10 Chapter 2: Descriptin f the experiment Speculative hypthesis H9: Efficiency. PSP quality management suggests t prefer simpler slutins ver clever nes. This might lead t prgrams that run mre efficient r use less memry, but might als lead t prgrams that run mre slwly r require mre memry. We hypthesize that we may find differences in speed and memry cnsumptin between grups P and N, althugh we cannt say in advance which frm these differences will have. Speculative hypthesis H10: Simplicity. The mre careful design phase presumably perfrmed by PSPtrained subjects might lead t simpler and shrter cde in grup P cmpared t grup N. Nte that this set f hyptheses is quite bld and the dds are nt the same fr each hypthesis: Fr estimatin and reliability, fr example, we clearly expect t see sme advantage f the P grup, while fr efficiency expecting any differences is pure speculatin. The results fr dcumentatin and prductivity must be treated with care, because the experiment instructins first and fremst called fr reliability, nt fr prductivity r maintainability. All f these hyptheses are expected t hld mre strngly if we lk at nly the better half f the participants in each grup, because then, accrding t the remark in Sectin 1.4, the PSP grup presumably cnsists mstly f subjects that indeed use a psp. In Sectin 3.2 n page 25, we will frm varius subgrups fr such cmparisns. As mentined abve, the purpse f the experiment is nt limited t frmally testing these hyptheses, but includes all ther apprpriate analyses that might prvide insights twards understanding the behavir differences and their effects. Cnsequently, we will use the hyptheses smewhat lsely in rder nt t be distracted frm pssibly mre imprtant ther bservatins. 2.3 Experiment frmat and cnduct After sme trial runs in August 1996, the experiment started in February 1997 and finished in Octber 1998. With a few exceptins, the participants wrked during the semester breaks frm mid-february t mid-april r mid-july t mid-octber. Subjects annunced their participatin by email and agreed n an appintment, usually starting at 9:30 in the mrning. Each subject culd chse the prgramming language t be C, C++, Java, dula-2 r Sather-K; tw subjects wuld have preferred Pascal. The cmpilers used were gcc, g++, JDK, mcka, and sak, respectively. An accunt was set up fr each subject n a Sun Unix wrkstatin. At mst three such accunts were in use at any given time. The accunt was setup s as t prtcl activity, in particular each versin f the surce cde that was submitted t the cmpiler. A subject was tld abut this mnitring n request. Due t a mistake made by ne experimenter (Prechelt) when adapting the setup f the participant accunts t a change in the sftware envirnment f ur wrkstatins (switch t a new versin f the Java develpment kit), the mnitring mechanism was crrupted and nnfunctinal during a significant part f the experiment and many f these cmpilatin prtcls were lst. In rder t prvide as natural a wrking envirnment as pssible and avid irritating the participants, n direct bservatin, vide recrding, r ther kind f visible mnitring was perfrmed. A subject was allwed t fetch and install auxiliary tls r data by FTP and install it fr the experiment. Fr instance, a few subjects brught their wn editr r re-used small parts f previusly written prgrams (e.g. file handling rutines). Sme, but nt all, PSP subjects brught their estimatin data and/r PSP tls. The subject was then given the first three parts f the experiment materials: first the persnal infrmatin questinnaire, then the task descriptin, and then the estimatin questinnaire (see Appendix A). After filling in the latter, the subject was left alne and wrked accrding t a schedule he culd chse freely. Only three restrictins were made: First, t always wrk in this special accunt; secnd, t use but ne surce cde file; A cntrlled experiment n the effects f PSP training

2.4 Experimental subjects 11 and third, t lg the wrking time n a special prtcling sheet we had prvided alng with the ther materials. Each f these restrictins was vilated by a few (but nt many) f the participants. The subject was tld t ask if he encuntered technical prblems with the setup r if he felt smething was ambiguus in the requirements. Technical prblems ccurred frequently and were then reslved n the spt. In rder nt t bias the results, the time required fr reslving the prblems (between 5 and 30 minutes per participant) is included in the wrk time. Inclusin is required because sme subjects may have chsen t reslve the prblem alne. Questins abut the requirements were asked by abut a dzen participants and in all cases they were referred t read the requirements mre clsely, because the apparent ambiguity was indeed prperly reslved in the descriptin they had received. The participant was asked nt t cperate with ther participants wrking at the same time r with earlier participants; we have nt fund any evidence f such cperatin and believe that essentially nne has ccurred. The subject was further tld t ask fr an acceptance test at any time if he felt his prgram wuld nw wrk crrectly. If the acceptance test failed, the subject was encuraged t analyze the prblems, crrect the prgram, and try again. Once the acceptance test was passed (r the subject gave up), the participant was given the pstmrtem self-evaluatin questinnaire. After finishing and returning the questinnaire, the subject was paid (see Sectin 2.4.1) and all data was cpied frm the subject s experiment accunt t a safe place. In a few cases, the subjects asked fr (and were granted) sme additinal time fr imprving the prgram after the acceptance test was passed. Typically this was used fr cleaning up unused cde sectins and inserting cmments int the prgram. 2.4 Experimental subjects This sectin will describe the backgrund f the experimental subjects. Often in this reprt we refer t a few f the subjects individually by name: these names are letter/number-cmbinatins in the range frm s12 t s102. 2.4.1 Overview The experiment had 50 participants, 30 f them in the PSP grup and 20 in the cntrl grup. Our 50 participants break up int the fllwing srts with respect t their mtivatin: 40 f them were bliged t participate in the experiment as a part f a lab curse they tk (29 frm tw PSP curses and 11 frm a Java curse, see the descriptin in Sectins 2.4.3 and 2.4.4). Nte that the bligatin included nly participatin, nt success: Even thse participants that gave up during the experiment passed their curse. One PSP participant, s045, retracted all f his materials frm the experiment and will nt be included in any f the analyses. The ther 10 participants were vlunteers. 8 participants were vlunteers frm ther lab curses in ur department. Fur f these were highly capable students wh expressedly came t prve that PSP-trained peple are nt better than thers. (s020, s023, s025, and s034. Interestingly, tw f these fur, s020 and s034, later participated in the PSP curse.) 2 participants (s081, s102) are actually repeaters : they had already participated in the experiment in February/arch 1997 as nn-psp subjects and vluntarily participated again in June 1998 and arch 1998, respectively, after they had taken the PSP curse. Quite bviusly, these differences need t be taken int accunt during data analysis; see Sectin 3.2 fr a descriptin f the actual grups cmpared in the analysis. All participants were mtivated twards high perfrmance by the fllwing reward system: They wuld be paid D 50 (apprximately 30 US dllars) fr successfully participating in the experiment (i.e., passing the acceptance test). Hwever, each failed acceptance test reduced the payment by D 10. Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

12 Chapter 2: Descriptin f the experiment 2.4.2 Educatin and experience The fllwing infrmatin excludes the tw repeater participants, the ne participant frm the PSP grup wh retracted all f his materials frm the experiment, and thse participants fr which the particular piece f infrmatin was nt available ( missing ). The participants were in their 4th t 17th semester at the university (median 8, see Figure 2.1; please refer t Sectin 3.1.1 n page 20 fr an explanatin f the plt). With the exceptin f tw highly capable furth semesterers (s023 and s050) all f the students were graduate students (after the Vrdiplm ). N P all 1 missing 1 missing 4 6 8 10 12 14 16 current semester number Figure 2.1: Distributin f semester number f subjects in the PSP grup (P), the nn-psp grup (N), and bth tgether (all). The participants had a prgramming experience f 3 t 14 years (median 8 years, see Figure 2.2) with an estimated 10(sic!) t 15000(sic!) actual prgramming wrking hurs beynd the prgramming exercises perfrmed in the university educatin (median 600 hurs, see Figure 2.3). N P all 1 missing 1 missing Figure 2.2: Distributin f years f prgramming experience in the PSP grup (P), the nn-psp grup (N), and bth tgether (all) 4 6 8 10 12 14 ttal prgramming experience [years] N P all 6 missing 7 missing 13 missing 0 1000 2000 3000 4000 5000 ttal prgramming experience [wrk hurs] Figure 2.3: Distributin f hurs f nn-university prgramming experience in the PSP grup (P), the nn-psp grup (N), and bth tgether (all). There is ne pint at 15000 in N. During that time, each f them had written an estimated ttal 4 t 2000(sic!) KLOC 2 with a median f 20, see Figure 2.4. The estimated ttal number f lines the subjects had written in the language they had chsen fr 2 One KLOC is thusand lines f cde. Fr rughly half f the participants this includes cmments, fr the thers it includes nly statements. A cntrlled experiment n the effects f PSP training

2.4 Experimental subjects 13 the experiment was frm 0.5 t 100 KLOC (median 5 KLOC, see Figure 2.6). The few extremely high values that ccur in mst f these variables shw that there are a few quite extrardinary subjects in the sample, in particular in the nn-psp grup. N P all 3 missing 1 missing 4 missing 0 100 200 300 400 500 ttal prgramming experience [KLOC] Figure 2.4: Distributin f ttal KLOC written in the PSP grup (P), the nn-psp grup (N), and bth tgether (all). There is ne pint at 2000 in N. N P all 0 20 40 60 size f largest prgram written [KLOC] 1 missing 1 missing Figure 2.5: Distributin f size f largest prgram written in the PSP grup (P), the nn-psp grup (N), and bth tgether (all). N P all 1 missing 1 missing 2 missing 0 20 40 60 80 100 prgramming language experience [KLOC] Figure 2.6: Distributin f prgramming experience (in KLOC) in the prgramming language used during the experiment by each individual subject in the PSP grup (P), the nn-psp grup (N), and bth tgether (all). Lking at these data, ur tw main grups appear t be reasnably balanced. The apparently smewhat larger values in the N grup fr ttal experience in KLOC and size f largest prgram may be spurius, because the nn-psp participants are mre likely t ver-estimate their past prductivity as we will see in Sectin 3.3. There are tw subjects (s034 and s043) that appear amng the tp three 5 times and 3 times, respectively, fr the 5 prgramming experience measures; bth are in the N grup. Nte that the distributin f prgramming languages differs between the N and P grup, see Figure 2.7. There is a relatively larger fractin f Java users in the N grup. Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

14 Chapter 2: Descriptin f the experiment Sather-K dula-2 Java C++ C Sather-K dula-2 Java C++ C 0 2 4 6 8 10 12 14 prgramming lang., N grup 0 2 4 6 8 10 12 14 prgramming lang., P grup Figure 2.7: Number f participants in the N grup (left) and the P grup (right) that have used each prgramming language. 2.4.3 The PSP curse (experiment grup) The PSP methdlgy is taught by means f a PSP curse. In its standard frm and as taught in ur department, this is a 15-week training prgram cnsisting f 15 lectures f ninety minutes each, 10 prgramming exercises, and 5 prcess exercises. This curse requires rughly ne full day per week. Each exercise is submitted t a teaching assistant wh carefully checks the crrectness f the materials (with respect t prcess) and marks any prblems s/he finds. The materials are handed back t the students wh have t resubmit them in crrected frm until everything is OK. The fcus f the curse is n the fllwing tpics: wrking twards (and then accrding t) a well-defined, written-ut sftware prcess, learning and using systematic planning and estimatin based n persnal histrical data, and defect preventin and remval by means f managed persnal reviews, defect lgging, defect data analysis, and perfrmance-data-cntrlled prcess changes. The participants f the P grup are frm the secnd and third time we taught the PSP curse. Fr the mst part, we used the agenda defined by Humphrey in his bk [3] n page 746. The participants needed t submit all exercises in gd shape (with resubmissins if crrectins were necessary) in rder t pass the curse. Thse few participants wh did nt pass the curse drpped ut vluntarily during the semester, nbdy was explicitly expelled. 2.4.4 The alternative curses (cntrl grup) The vlunteers f the N grup (s014, s017, s020, s023, s025, s028, s031, s034) came frm varius ther lab curses. The nn-vlunteers f the N grup all came frm an advanced Java curse ( cmpnent sftware in Java ); many f them had previusly als participated in a basic Java curse we had taught the year befre. The curse was fllwing a cmpressed schedule such that the curse ran ver nly 6 weeks f the 13-week semester, but required a very high time investment during that time. In terms f the ttal amunt f cde prduced, this curse is quite similar t the PSP curse, althugh it had nly 5 larger exercises instead f 10 smaller nes. The curse cntent was highly technical, cvering the then-new Swing GUI classes, lcalizatin and internatinalizatin, serializatin and persistence, reflectin and JavaBeans, and distributed prgramming (Remte ethd A cntrlled experiment n the effects f PSP training

2.5 Task 15 Invcatin). The prgrams were submitted t the curse teachers and tested in a black-bx fashin. Participants needed t scre 70% f all pssible pints (based n crrectly implemented functinality) in rder t pass the curse. 2.5 Task This sectin will shrtly describe the task t be slved in the experiment and will explain why we chse it. Fr details, please refer t the riginal task descriptin n page 66 in the appendix. 2.5.1 Gals fr chsing the task The tasks t be used in ur experiment shuld have the fllwing prperties: 1. Suitable size. Obviusly, the task shuld nt be t small in rder t prvide enugh and interesting data. A trivial task wuld have t little generalizability. The task was planned t take abut 4 r 5 hurs fr a gd prgrammer, s that mst participants wuld be able t finish within ne day, when they started in the mrning. (It later turned ut that nly 28% f the participants were able t finish n the day they started and 46% tk mre than tw days.) 2. Suitable difficulty. st if nt all f the participants shuld be able t cmplete the task successfully. In particular, it must be pssible t slve the task withut inventing an algrithm r data structure that requires high creativity. Furthermre, the applicatin dmain f the task had t be well understandable by all subjects. On the ther hand, it must be pssible t make subtle mistakes r ruin the efficiency s that there can be sufficient differences in the wrk prducts amng even the successful slutins. 3. Autmatic testability. In rder t test the quality f the slutins thrughly and bjectively it must be pssible t run a rather large number f tests withut human interventin. In particular, the acceptance test shuld be autmatic and entirely bjective. 2.5.2 Task descriptin and cnsequences Frm these requirements, we chse the fllwing task: Given a list f lng telephne numbers and a dictinary (list f wrds), encde each f the telephne numbers by ne wrd r a sequence f multiple wrds in every pssible way accrding t a fixed, prescribed letter-t-digit mapping. A single digit may stand fr itself in the encding between tw wrds under certain circumstances. Read the phne numbers and the dictinary frm tw files and print each resulting encding t standard utput in an exactly prescribed frmat. Please see the exact task descriptin n page 66 fr the details and fr input/utput examples. The abve functinality can be cded in abut 150 statements with any prgramming language that has a reasnable string handling capability. Understanding the requirements exactly and prducing an apprpriate search algrithm is nt trivial, but certainly within the capabilities f the participants. Varius details give enugh rm fr grss r subtle mistakes, e.g. handling special characters allwed in the phne numbers (slash, dash) r the wrds (qute, dash), always prducing the crrect utput frmat, r handling all cases f digit-insertin crrectly. The algrithmic nature f the prblem is simple t understand fr all subjects, regardless f specific Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

16 Chapter 2: Descriptin f the experiment backgrunds, and the search algrithm gives rm fr enrmus differences in the resurce cnsumptin (bth space and time) f the resulting prgram. The batch jb character f the prgram makes autmatic testing pssible and the simple structure f the input data even allws fr fully autmatic generatin f test cases nce a crrect gld prgram has been implemented. This allwed fr generating new data fr each acceptance test n the fly. During the evaluatin f the experiment it turned ut that the differences amng the slutins were even larger than expected. 2.5.3 Task infrastructure prvided t the subjects Alng with the task descriptin, the fllwing task-specific infrastructure was prvided t the participants: the miniature dictinary test.w and the small phne number input file test.t used in the example presented in the task descriptin, a file test.ut cntaining the crrect utput fr these inputs, a large dictinary called werter2 cntaining 73220 wrds. The same large dictinary was als used during the evaluatin f all prgrams presented in this reprt; a fact which we tld the participants upn request. 2.5.4 The acceptance test The acceptance test wrked as fllws: Fr each test, a new set f 500 phne numbers was created and the crrespnding crrect utput cmputed using the gld prgram. This tk nly a few secnds. The dictinary used was a randm but fixed subset f 20946 wrds frm the werter2 dictinary. Then the candidate prgram was run with these inputs and the utputs were cllected by an evaluatin Perl script. This script matches the utputs f the candidate prgram t the crrect utputs and cmputes the reliability f the candidate prgram. The evaluatin script stps the candidate prgram if it is t slw: an accumulative timeut f 30 secnds per utput was applied plus a 5 minute bnus fr lading the dictinary at the start. This means that, fr instance, the 40th utput must be prduced befre 25 minutes f wall clck time are ver r else the prgram will be stpped and its reliability judged based n the utputs prduced s far. any prgrams did indeed run fr half an hur r mre during the acceptance test; the number f expected utputs varied frm 25 t 248 with a typical range f 40 t 80. At the end f the acceptance test the fllwing data was printed by the evaluatin script: The srted actual utput f the candidate prgram, the srted expected utput (i.e., the utput f the gld prgram), a list f differences in the frm f missing crrect utputs and additinal incrrect utputs, and the resulting reliability in percent. The exact ntin f reliability will be defined in Sectin 3.4.3 under the name f utput reliability. A minimum utput reliability f 95 percent was required fr passing the acceptance test. Hwever, in their final acceptance test with nly tw exceptins all prgrams either achieved 100 percent r failed entirely. A cntrlled experiment n the effects f PSP training

2.6 Internal validity 17 2.5.5 The gld prgram Given this style f acceptance test, the befre-mentined gld prgram bviusly plays a rather imprtant rle in this experiment. The gld prgram was develped by Lutz Prechelt tgether with the develpment f the requirements. The initial requirements turned ut t be t simple, s the rules fr allwing r frbidding digits in the encding were made up and added t the requirements during the implementatin f the gld prgram. The gld prgram, called phnewrd, was develped using a psp in July 1996 during three sessins f abut six hurs ttal. The ttal time splits int 126 minutes f design (including glbal design, pseudcde develpment, and test develpment), 93 minutes f design review, 72 minutes f cding, 38 minutes f cde review, and 19 minutes cmpilatin. The prgram ran crrectly upn the first attempt and n defect was ever fund after cmpletin this despite numerus claims f participants that the acceptance test prgram is wrng. y prgram wrks crrectly!. 19 defects were riginally intrduced in the design and pseudcde, 11 f which were fund during design review, 6 defects were intrduced during cding and fund during cde review r cmpilatin. These values include trivial mistakes such as syntactical errrs. The prgram was written in C with refinements 3, which turned ut t be a superbly suitable basis fr this prblem. The initial prgram was nly mdestly efficient (trying 10% f the dictinary fr each digit). A few days later it was imprved int a mre efficient versin (called phnewrd2 trying nly 0.01% f the dictinary fr each digit) which als wrked crrectly right frm the start. This secnd versin was used thrughut the experiment. 2.6 Internal validity There are tw surces f threats t the interval validity f an experiment 4 : Insufficient cntrl f relevant variables r inaccurate data gathering r prcessing. 2.6.1 Cntrl Cntrlling the independent variable means hlding all ther influential variables cnstant and varying nly the ne under scrutiny. Cntrlling the dzens f pssibly relevant variables in human-related experiments is usually dne by randm sampling f participants int the experiment grups and subsequent averaging ver these grups: variatin in any ther than the cntrlled variable is then expected t cancel ut. Hwever, randm sampling is difficult fr sftware engineering experiments, because they require such a high level f knwledge. The prblem becmes particularly prnunced if the independent variable is a specific difference in educatin: neither can we randmly sample frm a large grup f pssible participants, nr can we freely assign each f them int a grup chsen at randm. Instead, we are cnfined t a small number f available subjects with the prper backgrund and, wrse yet, typically each f them fits int nly ne f the grups, because we cannt impse a certain educatin n the subjects and withhld the ther; they chse themselves what they want t learn. In the given experiment this means that the preferences that let the subjects chse ne curse and nt the ther culd in principle be related t the results bserved. We cannt prve that there is n such effect, but based n ur persnal knwledge f the individuals in bth curses, we submit that we cannt see any severe difference in their average capabilities. In fact, because they liked us as teachers, several participants frm the PSP curse later als tk the ther curse and vice versa. 3 http://wwwipd.ira.uka.de/ prechelt/sw/#crefine 4 Definitin frm [1]: Internal validity refers t the extent t which we can accurately state that the independent variable prduced the bserved effect. Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

18 Chapter 2: Descriptin f the experiment 2.6.2 Accuracy f data gathering and prcessing Inaccurate data gathering r prcessing is unlikely as there was very little manual wrk invlved in this respect. Instead, mst data gathering and almst all data prcessing was autmated. The scripts and prgrams were carefully develped and tested and their results again scrutinized. any cnsistency checks were applied fr detecting pssible mistakes. One remaining prblem is missing data, which is almst inevitable in any large scale experiment. The cnsequences f missing data, if any, will be discussed fr each dependent variable in the results sectins belw. 2.7 External validity Three majr factrs limit the generalizability (external validity) f this experiment: Lnger experience as a sftware engineer, lnger experience with psp use, and ther kinds f wrk cnditins r tasks. (Yu may perhaps want t skip the rest f this sectin until yu have seen the results and their discussin.) 2.7.1 Experience as a sftware engineer One f the mst frequent critiques applied t cntrlled experiments perfrmed with student subjects is that the results cannt be transferred t sftware prfessinals. The validity f this critique depends n the actual task perfrmed in the experiment: if the task requires very specific knwledge such as the capability t prperly use cmplex tls r esteric ntatins r uncmmn prcesses, then the critique is prbably valid. If, n the ther hand, nly very general sftware prductin abilities are required in the task, a graduate student grup perfrms nt much different frm a grup f prfessinals: It is knwn that experience is a rather weak predictr f perfrmance within a grup f prfessinals and in fact sn these same students will be prfessinals themselves. The task in this experiment is very general, requiring nly knwledge that is taught during undergraduate university educatin. Hence, we can expect t find relatively little difference in the behavir f ur student subjects cmpared t prfessinals. We can imagine nly tw differences that might be relevant. First, prfessinals taking a PSP curse after sme prfessinal experience may ften be much mre mtivated twards actually using PSP techniques, because due t previus negative experiences they have a much clearer cnceptin f the pssible benefits than students. Student backgrund ften invlves nly relatively small prjects with cmparatively little schedule pressure and little need fr relying n the wrk f clleagues. This mtivatin difference, if present, shuld prnunce the differences between PSP and nn-psp grups fr prfessinals. Secnd, sme f the less gifted students may later pick a nn-prgramming jb, resulting in a srt f clean-up (smaller variance f perfrmance in the lwer part) in a grup f prfessinals cmpared t a grup f students. The pssible cnsequences f this effect, if it exists, n the difference between PSP and nn-psp grups are unclear. 2.7.2 Experience with psp use The present experiment investigates perfrmance and behavir differences shrtly after a PSP curse. One shuld be rather careful when generalizing these results t persns that have been using a psp fr sme lnger time, say, tw years. A cntrlled experiment n the effects f PSP training

2.7 External validity 19 It is plausible that in thse cases where differences between the PSP grup and the nn-psp grup were fund, these differences will becme mre prnunced ver time. Hwever, fr a PSP-adverse wrk envirnment it is als cnceivable that differences wear ff ver time (because PSP techniques are n lnger used) and it is unclear whether differences may emerge ver time where nne have been bserved in the experiment. It wuld definitely be imprtant t run a similar experiment much lnger after a PSP (r ther) training. 2.7.3 Kinds f wrk cnditins r tasks As mentined abve, it is plausible that differences due t PSP training may be reduced by a wrking envirnment that discurages the srt f data gathering implied by PSP techniques; the level f actual PSP use may just drp. Inversely, the effects might als becme mre prnunced fr instance if the tasks are very difficult t get right, if the wrk cnditins demand accurate cmmunicatin f technical decisins, r if accurate planning can reduce the stress due t schedule pressure. Furthermre, sme f the PSP participants may have taken the experiment task t lightly and may have underused their psp in cmparisn t their standard prfessinal wrking behavir. Fr instance, sme f them did nt bring their PSP tls r PSP estimatin data. All f this is unknwn, hwever, s adequate care must be exercised when applying the results f this experiment t such different situatins. Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe

Chapter 3 Experiment results and discussin I dn t knw the key t success, but the key t failure is t please everybdy. Bill Csby This chapter presents and interprets the results f the experiment. The first sectin explains the means f statistical analysis and result presentatin that we use and explains why they were chsen. Sectin 3.2 describes hw, exactly, the grups t be cmpared in the analysis were derived frm the raw grups f PSP and nn-psp subjects. The fllwing sectins present the results (bjective perfrmance); the analysis is rganized alng the hyptheses listed in Sectin 2.2. (Warning: The amunt f detail in the diagrams and captins may verwhelm yu. The main text, hwever, is shrt and easy t read.) Tw final sectins describe findings frm an analysis f crrelatins between variables and findings frm analyzing the answers f the subjects given in the pstmrtem questinnaire. 3.1 Statistical methds In this sectin we will describe the individual statistical techniques (including the graphical presentatins) used in this wrk fr assessing the results. Fr each technique, we will describe its purpse, the meaning f its results, and its caveats and limitatins. The analyses and plts were made with S-Plus 3.4 n Slaris and we will shrtly indicate the names f the relevant S-Plus functins in the descriptin as well. 3.1.1 One-dimensinal statistics Fr mst f this reprt, we will simply cmpare the values f a single measurement fr all f the participants in the PSP grup against the participants in the nn-psp grup. The simplest frm f such a cmparisn is cmparing the arithmetic mean f the values in ne grup against the mean f the values in the ther. Hwever, the mean can be very misleading if the data cntains a few values that are very different frm the rest. A mre rbust basis fr a cmparisn is hence the median, that is, the value chsen such that half f the values in the grup are smaller r equal and the ther half are greater r equal. In cntrast t the mean, the median is nt influenced by hw far away frm the rest the mst extreme values are lcated. Pssibly we are nt nly interested in the average behavir f the grups, but als in the variatin (variability, variance) within each grup. In ur cntext, smaller variatin is usually preferable, because it means mre predictable sftware develpment. One way f assessing variatin is the standard deviatin. If the underlying 20

3.1 Statistical methds 21 data fllws a nrmal distributin (the Gaussian bell curve), abut tw thirds f the data (68%) will lie within plus r minus ne standard deviatin frm the mean. Hwever, if the data des nt fllw a nrmal distributin, the standard deviatin is plagued by the same prblem as the mean: a few far-away values will influence the result heavily the resulting standard deviatins can be very misleading. Sftware engineering data ften has such values and hence the standard deviatin is nt a reliable measure f variatin. Instead, we will ften use the interquartile range and similar measures we will explain nw in a graphical cntext. A flexible and rbust way f cmparing tw grups f values fr bth average (statisticians speak f lcatin ) and variatin (called spread ) is the bxplt, mre fully called bx-and-whisker plt (S-Plus: bwplt()). Yu can find an example in Figure 3.4 n page 28. The data fr the PSP grup is shwn in the upper part, the data fr the nn-psp grup in the lwer part. The small circles indicate the individual values, ne per participant. Only their hrizntal lcatin is imprtant, the vertical jittering was added artificially t allw fr discriminating values that happen t be at the same hrizntal psitin. The width and lcatin f the rectangle (the bx ) and the T-shaped lines n its left and right (the whiskers ) are determined frm the data values as fllws. The left edge f the bx is lcated such that 25% f the data values are less than r equal t its psitin, the right edge is chsen such that 75% f the data values are less than r equal t its psitin (which means that 25% are greater r equal that value). The psitin f the left edge is called the 25-percentile r 25% quantile r first quartile, the right edge is crrespndingly called the 75% quantile r third quartile. Similarly, the left and right whiskers indicate values such that exactly 10% f the values are smaller r equal (left whisker, 10-percentile) r 10% are larger r equal (right whisker, 90-percentile), respectively. The fat dt within the bx marks the median, which culd als be called 50-percentile, 50% quantile, r secnd quartile. Nte that different percentiles can be the same if there are several identical data values (called ties), s that, fr instance, whiskers can be missing because they are identical with the edge f the bx r the median dt can lie n an edge f the bx etc. Bxplts allw fr easy cmparisn f bth spread and lcatin f several grups f data. One can cncentrate either n the width f the bxes (called the inter-quartile range r iqr) r the width f the whle bxplts fr cmparing spread r can cncentrate n particular bx edges r the median dts fr cmparing different aspects f lcatin (namely the lcatin f the lwer half, upper half, r middle half f the data pints). Nte that fr distributins that have nly few distinct values (typically all small integers) and therefre cntain many ties, differences in the width f the bx r the lcatin f any f the quartiles can be misleading because it may change a lt if nly a single data value changes. Figure 3.26 n page 41 shws a simple example. The tw distributins are similar, but the bxplts lk quite different. A similar caveat applies when the number f data values pltted is small, e.g. less than ten. Our bxplts have ne additinal feature: the letter in the plt indicates the lcatin f the mean and the dashed line arund it indicates a range f plus r minus ne standard errr f the mean. The latter quantifies the uncertainty with which the mean is estimated frm the data and decreases with decreasing standard deviatin f the data and with an increasing number f data pints. Fr abut 68% f all data samples f the given size taken frm the same ppulatin, the sample mean will lie within this standard errr band. Fr symmetric distributins, the mean is equal t the median. Hwever, in ur data, many distributins are skewed, i.e., the data is less dense n ne end f the distributin than n the ther. In this case, the mean will lie clser t the less dense end than the median. A string such as 2 missing (e.g. n the left edge f Figure 3.5) indicates that tw f the data pints in the sample had missing values and hence are nt shwn in the plt. When cmparing tw distributins, say in a bxplt, it is ften unclear whether bserved differences in lcatin shuld be cnsidered accidental r real. This questin can be assessed by a statistical hypthesis test. A test cmputes the prbability that the bserved differences f, say, the mean will ccur when the underlying distributins in fact have the same mean. Technical Reprt 1/1999, Lutz Prechelt, Barbara Unger, University f Karlsruhe