and the custmer, by seeing all the tests run crrectly, t reduce defect rates in such a way that the successive test-cycle verhead becmes neglectable,

Experiment abut Test-first prgramming atthias. üller and Oliver Hagner Cmputer Science Department University f Karlsruhe Am Fasanengarten 5, 76 128 Karlsruhe, Germany fmuellermjhagnerg@ipd.uka.de Abstract Test-first prgramming is ne f the central techniques f Extreme Prgramming. Prgramming test-first means (1) write dwn a test-case befre cding and (2) make all the tests executable fr regressin testing. Thus far, knwledge abut test-first prgramming is limited t experience reprts. Nthing is knwn abut the benefits f test-first cmpared t traditinal prgramming (design, implementatin, test). This paper reprts abut an experiment cmparing test-first t traditinal prgramming. It turns ut that test-first des nt accelerate the implementatin and the resulting prgrams are nt mre reliable, but test-first seems t supprt better prgram understanding. 1 Intrductin Test-first prgramming is ne f the central techniques f Extreme Prgramming (XP). Testfirst cmbines tw general principles: first, write dwn the test-cases befre cding, and secnd, make them executable fr regressin testing. The whle prcess including a small design, nanincrements, and nging testing is als called test-driven develpment. Jeffries describes test-first prgramming in fur steps [1]: 1. Find ut what has t be dne. 2. Write a unit test fr the desired new functinality. Pick the smallest increment frm the new functinality. 3. Run the unit test. The new functinality is implemented, if the test succeeds. If there are still mre unimplemented functinalities, g t step 1. If the test fails, g t step 4. Otherwise, all is dne. 4. Fix the immediate prblem: maybe it's the fact that the new methd wasn't written yet. aybe the methd desn't quite wrk. Fix whatever it is. G t step 3. The tasks f test-first are manifld within the framewrk f XP rules and practices: t ensure that prgrammed features cannt be lst, t frce the develpers t think abut testable cde and t write dwn unit tests, t autmate test executin, t prevent the prgram frm regressing by nging retesting, and t prvide a test-suite as a basis fr refactring. The list f gals t attain with test-first is nt less imprtant: t develp prgrams that are mre capable f accepting changes, t prgram faster, t increase cnfidence f the develper 1

and the custmer, by seeing all the tests run crrectly, t reduce defect rates in such a way that the successive test-cycle verhead becmes neglectable, and t understand the prgram better. While experience shws that it is difficult fr the beginner t adpt t test-first [6, 10], the abve frmulated gals f test-first are still left fr evaluatin. One f the challenges f studying test-first is its embedding within XP. This embedding makes it difficult t shw the effects f test-first withut being blurred by ther practices such as pairprgramming r a simple design. A slutin t this prblem wuld be an experiment in which XP is applied twice: with test-first and withut test-first. The result wuld be an indirect evaluatin f testfirst. But this kind f experiment is t difficult and t expensive. T slve this prblem, test-first was extracted frm XP and evaluated n its wn. Nw, the experiment fcuses n a single prgrammer with his traditinal develpment prcess (design, implementatin, test). The authrs wanted t knw if there are any advantages r disadvantages fr a single prgrammer when switching frm the traditinal develpment prcess t test-first. Cncerning test-first, this paper fcuses n (1) the prgramming efficiency (hw fast smene btains a slutin), (2) the reliability f the resultant cde (hw many failures can be bserved), and (3) prgram understanding (measured as prper calls f existing methds). The experiment fcused neither n design aspects f the delivered slutin (e.g. hw changeable is the design) nr n the benefits f test-first in the lng run (e.g. shrter test-cycles, shrter time t market, r defect rates in the prductin cde). The experiment was cnducted as part f an XP curse held with CS graduate students. The participants were divided int tw grups: the experiment grup which used test-first and the cntrl grup which fllwed the traditinal develpment prcess. Bth grups had t implement the main class f a graph library cntaining nly the methd declaratins but nt the methd bdies. The subjects' wrk was divided int tw phases. During the first phase, the implementatin-phase, the subjects implemented the slutin up t the pint where they thught that their implementatin was crrect. The secnd phase, the acceptance-test phase, invlved passing an acceptance-test. If the acceptance-test failed, the subject gt the utput f the test and had t fix the faults. This was iterated until the acceptance-test succeeded. Only then, was the slutin accepted. The gal f the acceptance-test was t ensure a minimum f cde quality f the final slutins. The results shw ndifference between the tw grups cncerning the verall prblem slving time and the final reliability f the prduced results. But the test-first grup had less errrs when reusing an existing methd mre than nce. The last bservatin is statistically significant with p = 0:09. We als cmpared bth grups after the first phase, that is, befre the acceptance-test tk place. Again, we measured n difference in prblem slving time but the prgrams f the test-first grup were less reliable (significance p = 0:03). Lking at these results, we drew the fllwing cnclusins. Writing prgrams with test-first neither leads earlier t a slutin nr prvides mre reliable results. On the ther hand, using test-first increases prgram understanding measured as a prper reuse f existing interfaces. An pen questin remains frm ur study. Why are the prgrams f the test-first grup less reliable than thse f the cntrl grup at the end f the first phase? Pssible explanatins are the fllwing. (1) Were the subjects insufficiently experienced in the use f the testfirst apprach? That is, was their experience f test-first t small t see that a bit mre testing was needed? (2) Did they lull themselves in a false sense f security? Did the nging executin f the tests suggest a cde quality that did nt exist? Or (3) did they nt have 2

any respect fr the acceptance-test as a result f the nging testing? This culd be pssible because they knew an acceptance-test was t cme and tk it int accunt as an additinal quality measure at the end f the develpment prcess. S far, knwledge abut XP is limited t experience reprts. Only pair prgramming has been investigated t a certain extend [2, 11, 4, 12]. This paper starts an evaluatin f testfirst. In fact, it islates test-first frm the ther techniques f XP, but later, when we have an understanding f all techniques f XP, we can cmbine them and study their cmbined behavir. The reminder f this paper presents the experimental settings in Sectin 2 and the measured results and their discussin in Sectin 3. A summary f the paper is utlined in Sectin 4. 2 Descriptin f the experiment 2.1 Design f the experiment The experiment uses a single-factr, psttest-nly, inter-subject design [3]. The cntrlled independentvariable was whether the experimental subjects prgram test-first (experimentr test grup, subsequently called ") r use the traditinal develpment prcess (cntrl grup, subsequently called "). Each subject f either grup slved the same task and wrked under the same cnditins. The bserved dependent variables fr each subject were a variety f measurements f the develpment prcess (in particular ttal time), and varius measurements f the delivered prduct (in particular prgram reliability and number f reused methds). 2.2 Subjects Overall, 19 persns participated in the experiment, 10 in the and 9 in the. All f them were male Cmputer Science graduate students wh had just previusly participated in a ne-semester graduate lab curse intrducing the XP methdlgy, alng with a larger prgramming assignment. This curse cvered technics frm XP such as pair prgramming, test-first, refactring and planning. On average, these 19 students were in their 6th semester at the university, they had a median prgramming experience f 8 years ttal and estimated that they need a median f 3 mnths t prgram their largest prgram f abut 5000 LOC. Nne f these measures was significantly different between the tw grups. During the experiment all f the participants used Java with junit [9] as they did in the XP curse. Nne f the participants drpped ut f the experiment s that all wrk was available fr the evaluatin. 2.3 Experiment task The task t be slved in this experiment is called GraphBase". It cnsists f implementing the main class f a given graph library [7] cntaining nly the methd declaratins and methd cmments but nt the methd bdies. There are methds t add vertices and edges and t clear and t clne a whle graph. Other methds are nly accessr methds, e.g. t shw the number f vertices r edges, t find an edge between tw given vertices r t test if the graph is empty, weighted r directed. 3

Each subject is tld that the riginal cde f GraphBase was lst and, because there is n backup, that it shuld be reimplemented by using the rest f the given graph library. The requirements fr this task were described thrughly in natural language. The subjects were expected t wrk and t test n their wn until they thught, they had finished. They were als tld that they had t pass an acceptance-test t ensure sme quality f their slutins. 2.4 Experimental prcedure The experiment was run between July 2001 and August 2001, mstly during the semester breaks. st f the subjects started arund 9:30 in the mrning. The experiment materials were printed n paper and cnsisted f tw parts. Part ne was issued at the start f the experiment and cntained a task descriptin. The secnd part was handed ut at the end f the experiment. It cntained questins abut understandability f the dcumentatin and asked fr persnal ratings cncerning prgram understanding and reliability f the resultant prgram. The subjects wrked n the task using their specific Unix accunt frm the XP curse. The accunt was changed fr the experiment t prvide the autmatic mnitring infrastructure. It nnintrusively recrded lgin/lgut times, all cmpiled surce versins and all utput frm each prgram run. The recrded surce cde versins included the GraphBase-class, all written test-classes, and all ther Java-classes the subjects wrte in rder t slve the experiment task. The subjects culd mdify the accunt setup as necessary. The surce cde f the graph library except fr the GraphBase-methd-bdies was prvided t the subjects. Subject's wrk was divided int tw phases. Implementatin phase (IP), during which the subjects slved their assignment until they thught that their prgram wuld run crrectly. This phase finished with their call fr the acceptance-test. Acceptance-test phase (AP), during which the subjects had t fix the faults that caused the acceptance-test t fail. The acceptance-test itself is prgrammed using junit. It cnsists f 20 test cases with 522 assertins which build sme graphs and check their structure fr crrectness. If an assertin fails it generates an utput with the expected and the actual value, abrts the actual test and cntinues with the next ne. 2.5 Pwer analysis Chen [5] stresses the imprtance f pwer analysis t get a clser lk at the quality f a statistical hyptheses test. The pwer f a statistical test f a null hyptheses is the prbability that it will yield statistically significant results. It is defined as the prbability that it will lead t the rejectin f the null hyptheses, i.e., the prbability that it will result in the cnclusin that the phenmenm exists under the premise that the phenmenm is really existent. Statistically speaking, 1 pwer is the prbability fr an errr f the secnd kind. In ur experiment, we used grups with n = 9 and n = 10 subjects. Due t this small number f data-pints, we restrict ur analysis t find nly large effects. In this case, Chen suggests an effect size f ES = 0:8. We set the significance level f the ne-sided test t 4

ff =0:1. Thus, the pwer analysis with a t-distributin yields a pwer f 0:645 [8]. That is, we have nly a 64.5% chance t find a difference between the grups! Accrding t Chen, the experiment has a pr pwer. He argues that nly experiments with a pwer f mre than 0:8 have a real chance t reveal any effect. Therefre, it is quite reasnable, that accrding t this pr pwer, ur experiment has nt the chance t shw an effect, even if there is ne. But, as we culd nt acquire any mre subjects fr the experiment, we had t live with this drawback. 2.6 Threats t internal validity The cntrl f the independent variable is threatened by the fact that it was nt technically cntrlled. The subjects in the were tld in the requirements t use test-first prgramming as they did during the whle XP curse. During the experiment, the subjects were asked by the experimentatr several times if they gt alng with the test-first prcess. This questin was affirmed by all subjects f the. 2.7 Threats t external validity There are tw imprtant prblems fr the external validity (generalizability) f the experiment. First, prfessinal sftware engineers may have different levels f skill and experience than the participants, which might make the results t ptimistic r t pessimistic: bth higher and lwer levels will ccur, because the students are mre skilled than mst f the nn-cmputer-scientists that frequently start wrking as prgrammers. Better skilled subjects might leave less rm fr imprvement which might reduce the difference between the grups, but higher experience may als sharpen the eye as t where imprvements are mst desirable r mst easily achieved. Cnversely, lwer skill may leave mre rm fr imprvement but may als impede applying test-first crrectly at all. Secnd, the XP educatin f the subjects ccurred nly a shrt time ag. It is cnceivable that the test-first usage f these persns had nt yet stabilized and the mid-term benefits wuld be higher than bserved in the experiment. Futhermre, wrk cnditins different than thse fund in the experiment may psitively r negatively influence the effectiveness f test-first. 3 Results and Discussin This sectin investigates the fllwing tpics: prblem slving time, reliability f the prduced results, cde reuse, and tester quality. Bx plts are used t shw the results f the measurements. The filled bxes within a plt cntain 50% f the data pints. The lwer (upper) brder f the bx marks the 25% (75%) quantile. The left (right) t-bar shws the 10% (90%) quantile. The median is marked with a thick dt (ffl). The with the dashed line mark the mean value within a range f ne standard errr n each side. The variance f a data-distributin is measured as fractin f the 75%- t the 25%-quantile. Significance was calculated with the Wilcxn-Test where the significance p dentes the prbability that the bserved difference is due by chance. 5

3.1 Prblem slving time Fr the evaluatin f the prblem slving time, the fllwing times and fractins were cmpared: times spent fr the whle task, figure 1, the time spent in the implementatin phase, figure 2, and the prtin f the implementatin phase t the whle assignment, figure 3. 400 600 800 1000 Figure 1: Overall wrking time in minutes. Figure 1 shws nly a small difference between bth grups. This means the prgramming effrt increases slightly when switching frm traditinal prgramming t test-first. But hw is the behavir f the tw grups prir t the first acceptance-test? Figure 2 shws the time spent in the IP. 200 400 600 800 1000 Figure 2: Wrking time in minutes fr the implementatin-phase. And again, the bx plts shw nly a slight difference between the medians. But when we lk at the prtin f the IP related t the whle assignment, see figure 3, we see that the spent relatively less time in the IP. While there is n statistical evidence (p =0:158) fr this bservatin, the effect is quite visible. T sum up, the was nt mre efficient than the, as we riginally expected. But the tended t spend less time fr the IP related t the verall wrking time. 3.2 Reliability In this experiment, reliability was measured as prtin f the passed assertins related t all pssible executable assertins in the test. The initial behavir f junit had t be adjusted t 6

40 60 80 Figure 3: Prtin f implementatin-phase in percent t the verall wrking time. cunt all failed assertins. That is, junit was mdified in such a way that it did nt abrt a test after a failed assertin. Instead, it cntinued the test-case s that all assertins were executed. The failed assertins were cunted and printed ut at the end f the test-run. Reliability was measured fr tw prgrams: the acceptance-test and a randm-test with 727; 190 methd invcatins and abut 7.5 millin assertins. The reference implementatin runs abut seven secnds fr the acceptance-test and abut 150 secnds fr the randmtest. The randm-test calls the methds f the implementatin randmly and cmpares the resulting data-structure with the ne built by the reference implementatin. Deviatins in the structure are caught by subsequent assertins. Each methd gt a weight t ensure that ht methds, such as insert-edge were called mre ften than mre unusual methds, such as clne-graph. First, we lk at the reliability f the randm-test fr the final prgrams, see figure 4. 0 20 40 60 80 100 Figure 4: Reliability f final prgrams fr randm-test in percent. Only three prgrams f the are less reliable than the median f 91% f the. Even if the median f the is smaller than that f the (84% t 91%, respectively), the bserved difference is with p = 0:2 due by chance. The variance f the data-distributins differs with 2:38 fr t1:00 fr the. T interpret the difference in the medians as a trend twards a better reliability f test-first is dangerus because f the large variance f the data-pints. Nevertheless, it is remarkable that five prgrams f the achieve a reliability ver 96% cmpared t nly ne prgram f the. Nw, the prgrams right after the IP are examined and the questin is asked, what wuld 7

have happened if the acceptance-test had been mitted? This questin is interesting in as much as these prgrams represent the utput f the pure test-first prcess withut further mdificatin r enhancement by any external quality cntrl. These prgrams represent the versins the subjects are mst cnfident f cncerning accurateness. Reliability f the first run f the acceptance-test is discussed, see figure 5. 20 40 60 80 100 Figure 5: Reliability f prgrams fr acceptance-test after the implementatin phase in percent (p =0:03). The reliability fthe is significantly lwer p = 0:03 than that fr the. Except fr three prgrams, all prgrams in the are less reliable than the wrst ne in the and tw are even wrse than 20%. 0 20 40 60 80 100 Figure 6: Reliability f prgrams fr randm-test after the implementatin phase in percent (p =0:067). The results fr the randm-test are quite similar as shwn in figure 6. The lwer reliability f the prgrams f the is significant with p =0:067. But what is the reasn fr this difference? Is it because f the nging testing that lulls the develper in a false sense f security? False sense, because all f the tests run at 100% but d nt cver the main faults. Or is it, because the subjects f the are s used t testing that the acceptance-test degenerates t just anther test and lses its imprtance as a cntrl instance? And is it fr that reasn, that the subjects f the are mre mtivated t have less faults entering the acceptance-test phase which leads cnsequently t mre reliable prgrams? But these reasns are all speculatins. S far, we d nt knw it. 8

At this pint, it is f n further imprtance and nt surprising that the has a significant larger imprvement in reliability during the acceptance-test phase than the because it fllws quite naturally frm the abve presented results. 3.3 Cde reuse Examining cde reuse might lead t sme cnclusins abut prgram understanding. Three measures were used t get a perceptin f it. These are (1) the number f reused methds, (2) the number f failed methd calls, and (3) the number f methd calls that failed at least twice. The data-sets f the last tw measures were btained with silent assertins inserted int the existing graph library. Their utput was written t a lg file withut ntice f the subjects. Figures 7 and 8 shw the results fr the number f reused methds and the number f failed methd calls, respectively. 20 22 24 26 28 30 Figure 7: Number f reused methds. 4 6 8 10 12 14 16 Figure 8: Number f assertins that failed at least nce. Neither figure shws a remarkable difference between the tw measures. But there is a difference between the tw grups when studying the number f methd calls that failed at least twice, see figure 9. The test-first grup had significantly less errrs (p = 0:086) by reusing a methd mre than nce. By cmbining these tw bservatins int a single result the authrs have t say: test- 9

2 4 6 8 10 12 14 Figure 9: Number f methds that failed mre than nce (p =0:086). first des nt aid the develper in a prper usage f existing cde but it guides him thrugh nging testing in the prcess f fixing the fault. 3.4 Tester quality The quality f the subjects' tester-classes is the last studied measure. Branch cverage was used t measure subject's tester quality. T get an unbiased cmparisn, each tester executed the reference implementatin and nt the subject's implementatin. In rder t cllect this measure, the reference implementatin was mdified. Each grund blck f the reference implementatin gt a distinct number. On entry f the grund blck, this number was written ut. The different numbers were cllected and summed up fr each tester. The reference implementatin had a ttal f 45 branches. This measure was cllected at the end f the experiment, after the AP. Figure 10 shws the fractin f the actually executed branches related t all pssible branches f the reference implementatin. 40 50 60 70 80 90 Figure 10: Percentages f branches the subjects' testers cvered in the reference implementatin. Originally, we assumed that the testers f the have a better branch-cverage than thse f the. This assumptin based n the hyptheses that the delivered mre reliable prgrams. But as ur hyptheses des nt hld t sme extent, we d nt expect ur assumptin t hld either. The medians f bth grups differ with 80% fr t 74% fr. This difference is with p = 0:451 mst likely due by chance. The fllwing bservatin 10

is remarkable. Despite the fact, that 80% f the 's data-pints are smaller than the median f the, the had a slightly better reliability intheir final implementatins, see figure 4. 4 Cnclusins This paper presented an experiment abut test-first prgramming cnducted at the University f Karlsruhe at the end f the summer lectures 2001. Subjects were CS graduate students wh participated in an Extreme Prgramming practical training curse. The study cmpared test-first prgramming with the traditinal develpment prcess. In particular, it investigated the influence f test-first prgramming n (1) prgramming speed, (2) prgram reliability, and (3) prgram understanding measured as prper reuse f existing methds. The experiment data led t the fllwing bservatins. 1. If a develper switches frm traditinal develpment t test-first prgramming, he des nt prgram necessarily faster. That is, he des nt arrive at a slutin mre quickly. 2. Test-first pays ff nly slightly in terms f increased reliability. In fact, there were five prgrams develped with test-first with a reliability ver 96% cmpared t ne prgram in the cntrl grup. But this result is blurred by the large variance f the data-pints. Cncentrating n the prgram versins after the implementatin-phase, the result just turns arund. The test-first grup has significantly less reliable prgrams than the cntrl grup. S far, we d nt knw, if this effect is caused by a false sense f security, less imprtance f the acceptance-test fr the test-first grup, r if it is quite simply a result f t little testing. 3. Test-first prgrammers reuse existing methds faster crrectly. This is caused by the nging testing strategy f test-first. Once a failure is fund, it is indicated by a testcase and, while fixing the fault, the develper learns hw t use the methd r interface crrectly. Despite the bserved results, this study is far frm being a cmplete evaluatin f testfirst prgramming. The authrs encurage ther researchers t repeat the experiment r t cnduct a similar ne in rder t extend the knwledge abut test-first. References [1] Cde unit test first. http://www.c2.cm/cgi/wiki?cdeunittestfirst. [2] D. Bisant and J. Lyle. A tw-persn inspectin methd t imprve prgramming prductivity. IEEE Transactins n Sftware Engineering, 15(10):1294 1304, Octber 1989. [3] L. B. Christensen. Experimental ethdlgy. Allyn and Bacn, 1994. [4] A. Cckburn and L. Williams. The csts and benefits f pair prgramming. In extreme Prgramming and Flexible Prcesses in Sftware Engineering, XP2000, Cagliari, Italy, June 2000. [5] J. Chen. Statistical Pwer Analysis fr the Behaviral Sciences. Academic Press, 1977. 11

[6] R. Gittins, S. Hpe, and I. Williams. Qualitative studies f xp in a medium sized business. In Prceedings f the 2nd Cnference n extreme Prgramming and Flexible Prcesses in Sftware Engineering, Cagliari, Italy, ay 2001. [7] David Gldschmidt. Design and implementatin f a generic graph cntainer in java. aster's thesis, Rensselaer Plytechnic Institute in Tray, New Yrk, April 1998. [8] R. Ihaka and R. Gentleman. R: A language fr data analysis and graphics. Jurnal f Cmputatinal and Graphical Statistics, 5(3):299 314, 1996. [9] junit.rg. http://www.junit.rg/. [10]. üller and W. Tichy. Case study: Extreme prgramming in a university envirnment. In Prceedings f the 23rd Internatinal Cnference n Sftware Engineering, pages 537 544, Trnt, Canada, ay 2001. [11] J. Nsek. The case fr cllabrative prgramming. Cmmunicatins f the AC, 41(3):105 108, arch 1998. [12] L. Williams, R. Kessler, W. Cunningham, and R. Jeffries. Strengthening the case fr pair-prgramming. IEEE Sftware, pages 19 25, July/August 2000. 12