Twin Case Study: Treatment for Articulation Disabilities

Twin Case Study: Treatment for Articulation Disabilities Sirius Qin and Jun Chen November 5, 010 Department of Statistics University of British Columbia For Angela Feehan M.Sc student Audiology and Speech Sciences University of British Columbia

1 Introduction Articulation disabilities refer to difficulties that some children with speech impediments have in forming sounds and syllables correctly. Children who have articulation disabilities may make many possible mistakes, such as substituting some sounds for other sounds, missing some sounds, pronouncing three syllables for a word just with two syllables, and so on. In general, the most common error sounds are [r], [l], [sp], [st]. A good time to detect these disabilities in children is at around age 6 since, at this age, the children usually have learned all the sounds, and their difficulties in pronouncing certain sounds begin to appear. Experiment Conduct and Data Description A single pair of six-year-old twin boys who have articulation disabilities were involved in this experiment. Their articulation abilities were measured at the beginning of the experiment, termed as assessment 1. After four months, which was summer time during this experiment, they were assessed a second time, termed as assessment. To this time point, the two boys had not received any active treatments. These baseline measurements were intended to detect natural improvement in the pronunciation ability for the two boys. Then Twin A was given treatment 1, in which a teacher taught him English grammar for 4 month. At the same time, Twin B was given treatment, in which another teacher taught him how to pronounce those hard-to-articulate sounds. At the end of the treatments, the two twins articulation abilities were assessed for the third time, termed as assessment 3. Then Twin A was given treatment and Twin B was given treatment 1 for another four months (the same treatment was given by the same teacher). After that, the two twins were assessed at the fourth time, termed as assessment 4. There were four assessments for each boy, and all 8 assessments were the same. In each assessment, the boy was asked to read the same word list consisting of 86 words. In those 86 different words, were used for assessment. These words were intended to test 5 specific criteria: whether the sound [r] can be articulated. Test words are ribbon, rabbit, etc. whether the sound [l] can be articulated. Test words are whistle, pickle, etc. whether the sounds [st,sp] can be articulated. Test words are star, spaceship, etc. 1

whether the boy can articulate some words with particular shape, for example bridge, which has the structure CCVCVC, where C for consonant and V for vowel. whether the boy can articulate a three syllable word. Test words are computer, gorilla, etc. Among the words, the word spool was used to test both [st,sp] and [l]. Every other word was used to test only 1 criteria. So, if we treat spool as two test items, one testing [st,sp] and one testing [l], then there are (1 + 1 + 1) test items in all. All test words were distributed throughout the word list. When the boy read one of the test words other than spool, the result of his pronunciation was coded as either 1 (if this boy can satisfy the criteria this word is testing) or 0 (if this boy can t). When the boy read spool, there were two results. One was whether this boy can pronounce [sp] correctly or not (1 or 0); the other was whether this boy can pronounce [l] correctly or not (1 or 0). So, in every assessment, there were results in total. Figure 1 shows the proportion of correctly pronounced results in each assessment and how they changed over the different assessments. Figure 1: The proportion of test items pronounced correctly

3 Statistical Questions Since the number of subjects is only two in this experiment, any statistical conclusions only relate to these two boys, rather than a general population. Questions of interest are: How to compare the results of any two assessments on the same boy? between the two results statistically significant? Is the change How to appropriately estimate the magnitude of the treatment effects? Are these treatments effects significant? What is the proper way to compare the effects of the two treatments? 4 Statistical Suggestions 4.1 Basic Assumptions First of all, you may be willing to think that the result of every test item in every assessment for every boy is random, rather than deterministic. This means, for each test item, even though the boy has articulation problems, he wouldn t always pronounce this item incorrectly. Instead, he has a certain chance to pronounce it correctly. For example, if this boy has difficulty in articulating [r] and is asked to read ribbon 10 times, he may end up with 6 correct and 4 incorrect, or with correct and 8 incorrect, or some other result. In your experiment, no test item was read more than 1 time in one assessment. However, from the above point of view, when one test item was read correctly or incorrectly in one assessment, it was because the boy was lucky or unlucky at that attempt. Of course, how likely this boy is to be lucky depends on his articulation ability. 4.1.1 Overall average probability If you believe that the result of every test item in every assessment for every boy is random, then you might imagine, for a particular boy, when he reads a test item, he has a certain probability of pronouncing this item correctly. For example, when this boy is asked to say rabbit, he may have 60% chance of saying it correctly. However, this probability may be different for 3

different items, and may be different for the same item in different assessments. Let s first focus on one assessment for one boy only and consider other assessments and the other boy later. For instance, let s first look at assessment 1 for Twin A. Among your test items, some were to test [r], some were to test [l] and so on. So, in assessment 1, Twin A may have 60% chance of pronouncing [r]-items correctly, but may have only a 45% chance of pronouncing [l]-items correctly. To be more general, even two [r]-items may have different chances of being articulated correctly. This means, the probability of pronouncing rabbit correctly and the probability of pronouncing ribbon correctly may be different, while the probability of pronouncing rabbit correctly and the probability of pronouncing ribbon correctly may also be different. However, you might not be very interested in the probability associated with a specific test item or associated with a specific criteria. Rather, you might be interested in the average probability of Twin A articulating an arbitrary test item. In that sense, you could use an overall average probability (denoted by P 1, where the subscript 1 denotes assessment 1) to describe the chance of Twin A articulating this arbitrary test item. This means, in assessment 1, Twin A has a chance of P 1 of pronouncing a test item correctly, no matter which item it is, or which kind of item it is. This P 1 represents the articulation ability of Twin A at assessment 1. You may think of this perspective as an assumption, but you could also think of it simply as a different way of viewing this boy s articulation ability. In fact, we don t have to require Twin A to have equal chance of articulating different test items. We just simply don t look into the subcategories. For example, a boy might be asked to read 90 words, 30 of which are [r]-items, 30 of which are [l]-items, and 30 of which are [st]-items. As a result, this boy may correctly articulate 10 [r]-items, 0 [l]-items, and all [st]-items. In this case, it may be hard to convince people that the boy has equal chance of pronouncing different kinds of items correctly. However, it is still meaningful to say that this boy has a 60% chance of articulating each item, if we understand that this 60% represents his overall average probability. Let X j be the result of each test item in assessment 1 for Twin A, where j = 1,,...,. Then X j was either 1 (if Twin A articulated this item correctly) or 0 (if Twin A didn t). Making use 4

of the overall average probability P 1, we have: 1 with probability P 1 X j = 0 with probability 1 P 1 4.1. Independence In general, when a boy is asked to read a bunch of words, the result of reading previous words (correctly or incorrectly) may influence his performance on the present word. For example, when a boy is asked to read two words consecutively, he may be more likely to articulate the second word correctly if he correctly pronounced the first word. In contrast, he may be more likely to fail to articulate the second word correctly if he didn t correctly pronounce the first one. Let s still focus on assessment 1 for Twin A. The above statements suggest that the result of each test item from assessment 1 for Twin A might be related. In this case, it becomes much harder for us to describe the overall average probability P 1 by using the our data from assessment 1 for Twin A, since we are unlikely to know in which way the results of test items are related. However, in your experiment the test words were distributed throughout the word list. So, after Twin A read the first test word, he didn t read the second test word immediately, but read some other non-test words. Furthermore, there was always a break between any two words (15 seconds, 0 seconds, etc.). Usually, the relation structure becomes weaker when time elapses. So, from this point of view, you might be willing to assume that the results on the different test items didn t influence each other. Statistically speaking, this corresponds to an assumption that the boy s responses on the different test items are independent. As the result on every test item (X j ) provides information on the overall average probability P 1, it is meaningful to combine the information from all the X j s. Let ˆP 1 be the average of all the X j s: ˆP 1 = X 1 + X +... + X. Then, ˆP1 is simply the proportion of correctly articulated test items in assessment 1 for Twin A, which is a natural estimator of P 1. As illustrated in Figure 1, we have ˆP 1 = 7/ 0.304. 5

4.1.3 Other assessments Similar to P 1, the overall average probability used to describe the articulation ability for Twin A in assessment 1, the articulation abilities for Twin A in assessment, 3 and 4 can also be described by the overall average probabilities, P, P 3 and P 4. These overall average probabilities describe how the articulation ability of Twin A changed over these 4 assessments. All 4 assessments were the same: the boy was asked to read the same word list and the test items were the same at every assessment. So, as in assessment 1, Twin A also got results on these test items in every other assessment (assessment, 3 and 4). The proportions of correct responses on the subsequent assessments are ˆP, ˆP3 and ˆP 4. As illustrated in Figure 1, we have ˆP = 5/ 0.17, ˆP3 = 8/ 0.348 and ˆP 4 = 19/ 0.86; these are the natural estimators of P 1, P, P 3 and P 4. You may notice that ˆP = 5/ < ˆP 1 = 7/, but this does not justify a claim that P < P 1. ˆP is just the proportion of correct articulations by Twin A in assessment a realization of P, rather than the value of P ; similarly for ˆP 1. That is, even if P > P 1 holds, their realization may still have opposite relation. For Twin B, everything is analogous. At each assessment, the articulation ability of Twin B can be described by his overall average probability, P1, P, P3 and P4. Here, the star label * means these probabilities are for Twin B rather than Twin A. These overall average probabilities tell us how the articulation ability of Twin B changed over his 4 assessments. The assessments for Twin B are identical to these for Twin A. If we denote the proportion of correctly articulated responses for Twin B by P ˆ 1, P ˆ, P ˆ 3 and P ˆ 4, we have Pˆ 1 Pˆ = 11/ 0.478, P ˆ 3 = 18/ 0.783 and P ˆ 4 estimators of ˆ P 1, ˆ P, ˆ P 3 and ˆ P 4. = 11/ 0.478, = 18/ 0.783; these are the natural In the last subsection, we mentioned the assumption that the boy s responses on different test items in one assessment are independent. That assumption underlies everything in this report. Here, we make an additional independence assumption: that the results on different assessments are independent. This is a natural assumption. For a particular boy, the time between consecutive assessments is 4 months, so it seems reasonable to assume that the results on the previous 6

assessment do not influence the results on the next assessment, nor the results on the subsequent assessments. Furthermore, the results of any assessment on one boy should not influence the results of any assessment on the other boy. This independence assumption implies that all the ˆP i s and P ˆ i s are independent. In conclusion, for any particular boy in any particular assessment, you could use one overall average probability to describe his articulation ability, and you could also use the proportion of correct results on the test items from that assessment to estimate that probability. Figure shows the whole picture of different overall average probabilities and their corresponding estimators, the proportions of correct responses. Figure : Overall average probabilities and the proportions of correct responses 4. Comparing the results from two assessments on one Twin In your summary, you asked how to determine if the change seen after treatment 1 is significant for Twin A. From the experimental data, we observe the result of assessment for Twin A, which is ˆP (before treatment 1), and the result of assessment 3 for Twin A, which is ˆP 3 (after treatment 1). So, the question becomes if the observed difference between ˆP and ˆP 3 justifies a claim that the articulation abilities of Twin A at these two assessments are different (P P 3 )? The results obtained on the assessments are random, so even if P and P 3 are the same, which means the articulation abilities of Twin A are the same at assessment and assessment 3, the observed ˆP and ˆP 3 may still be different. On the other hand, if the observed ˆP and ˆP 3 are 7

different, P and P 3 may be the same. From this perspective, we may hope that the difference between ˆP and ˆP 3 is very large so that we are confident to claim that P and P 3 are different. A confidence interval for P 3 P, which is a range of plausible values for P 3 P, can be used to assess if the difference between ˆP and ˆP 3 ( ˆP 3 ˆP 0.348 0.17 = 0.131), is large enough to provide convincing evidence that P and P 3 are different. To construct a confidence interval for the difference P 3 P, we first need to know the standard error of ˆP and the standard error of ˆP3. The standard error of ˆP tells us how precise ˆP is as an estimator of P. Under the assumption that the boy s responses on the different test items on assessment are independent, the standard error (SE) of ˆP is given by the formula: SE( ˆP ˆP (1 ) = ˆP ), (1) which yields SE( ˆP 5/(1 5/) ) = 0.086. Similarly SE( ˆP 8/(1 8/) 3 ) = 0.099. Furthermore, we need to know the standard error of ˆP3 ˆP, which tells us how precise this difference is as an estimator of P 3 P. Under the assumption that the boy s results at the different assessments are independent, this standard error is given by: SE( ˆP 3 ˆP ) = SE( ˆP ) + SE( ˆP 3 ), () which yields SE( ˆP 3 ˆP ) 0.086 + 0.099 0.131. The 95% approximate confidence interval for P 3 P then is given by: ( ˆP 3 ˆP ) ± 1.96 SE( ˆP 3 ˆP ), which yields 0.131 ± 1.96 0.131, or in this case the interval [ 0.16, 0.388]. If the confidence interval contains 0, which is true in this case, then the difference between ˆP 3 ˆP, the change seen after treatment 1 for Twin A, is not significantly different from 0. In other words, the data do not provide convincing evidence that the articulation abilities P and P 3 are different. If the confidence interval had not contained 0, then the change seen after treatment 1 for Twin A would have been significantly different from 0, indicating that the articulation abilities P and P 3 are different. Similarly, the time period between assessment 1 and assessment was intended to detect if there is a natural improvement in Twin A s articulation ability. The above confidence interval method 8

is one way to address that question: construct a confidence interval for P P 1 and see if this interval covers point 0. If it does, you may believe that there is no such natural improvement of Twin A since the change between assessment 1 and assessment for Twin A is not significantly different from 0. We already had ˆP 1 0.304 and ˆP 0.17, so ˆP ˆP 1 = 0.17 0.304 0.087. Similar to the term (1), we have SE( ˆP 0.304(1 0.304) 1 ) = 0.096. So, from the term (), we have SE( ˆP ˆP 1 ) 0.086 + 0.096 0.19. The 95% approximate confidence interval for P P 1 is given by: ( ˆP ˆP 1 ) ± 1.96 SE( ˆP ˆP 1 ), which yields 0.087±1.96 0.19, or in this case the interval [ 0.340, 0.166]. This 95% approximate confidence interval for P P 1 contains 0, so the data indicate that there is no evidence of natural improvement for Twin A based on the results from his two assessments. If you want to compare the results from any two assessments for Twin B, you could also make use of this confidence interval method. 4.3 Assessing the treatment effects By using the method from the last section, you can assess if the articulation abilities of either boy at two assessments are different. A further question, probably of greater interest, is: can we attribute such a difference to the effects of the treatments? For example, suppose that the articulation ability of Twin A in assessment 3 is higher than the articulation ability of Twin A in assessment (P 3 > P ). Can we than claim that this improvement is the result of treatment 1? 4.3.1 Stable natural improvement assumption In your experiment, the two boys were given treatment every one or two weeks on the weekend. On weekdays, these two boys were just like other 6-year-old boys. They met their friends, went to class, and so on. They talked with others every day. So it seems natural to expect that their articulation abilities would improve during these daily activities. In your experiment, there was no treatment given during the time period between assessment 1 and assessment. So, this period was intended to detect the effects of the possible natural improvements of Twin A and 9

Twin B. Take Twin A for example. Suppose there was indeed a natural improvement, then the difference of articulation abilities of Twin A in assessment 1 and in assessment (P P 1 ) should be attributed to that natural improvement. Then if Twin A s articulation ability in assessment 3 (P 3 ) and in assessment (P ) are different, we should attribute this change partially to the effect of treatment 1 and partially to his natural improvement! (If one expects a natural improvement in one period time then one would also expect natural improvement in the other time periods.) Similarly, if Twin A s articulation ability in assessment 4 (P 4 ) and in assessment 3 (P 3 ) are different, we should also partially attribute that difference to his natural improvement. Now, you might be willing to assume that the natural improvement of Twin A is stable; that is, if there was no treatment, the change of Twin A s articulation ability in the nd and 3rd four month time periods will be the same as the change in the 1st four month time period. You might further be willing to assume that when Twin A received a treatment, as is the case in the nd and 3rd time periods, he still has the same natural improvement as if he didn t receive any treatment. If you think these assumption are reasonable, then we will be able to decompose the change of Twin A s articulation ability into the effect of treatment and the effect of his natural improvement and then estimate the effects of both treatments. We could proceed similarly for Twin B. Interestingly, Figure 1 provides no suggestion that there is improvement in either twin from assessment 1 to assessment, since ˆP 1 > ˆP and ˆ P 1 = ˆ P. However, as we mentioned in Section 4.1.3 and in the very beginning of Section 4.1, the result of each assessment is random rather than deterministic, so ˆP i and ˆ P i are just th,e realization of the twins true articulation abilities. Besides, from the discussion in this subsection, you might be willing to think that the natural improvement exists in all the time periods. So, the method of estimating the treatment effects, which ideally would have been decided prior to seeing the data, should allow for this possibility. 4.3. Estimating the treatment effects If you are willing to assume the natural improvements of the twins are stable, then you could estimate the treatment effect in the following way. First, we focus on Twin A. Let T A be the natural improvement of Twin A. Then T A caused the change of Twin A s artic- 10

ulation ability from assessment 1 to assessment, P P 1 (see Figure ). So we have: T A = P P 1. (3) The change of Twin A s articulation ability from assessment to assessment 3, P 3 P, then can be expressed as T A plus the effect of treatment 1 on Twin A. So, in order to assess the effect of treatment 1 on Twin A, we need to subtract T A from P 3 P. If T 1,A denotes the effect of treatment 1 on Twin A, we have: T 1,A = (P 3 P ) T A. (4) Combining (3) and (4), we have: T 1,A = (P 3 P ) (P P 1 ) = P 3 P + P 1. (5) From previous sections, we know that ˆP 1, ˆP and ˆP 3 are natural estimators of P 1, P and P 3 respectively. So we can get the natural estimators of T A and T 1,A, which are: T A = ˆP ˆP 1 = 5 7 = 0.087, (6) T 1,A = ˆP 3 ˆP + ˆP 1 = 8 5 + 7 = 5 0.17. (7) Note that in Section 4., we already discussed the difference of Twin A s articulation ability at assessment 1 and at assessment. In Section 4., we detected if there is a natural improvement in Twin A by looking at this difference. Now, we estimate Twin A s natural improvement as this difference. The cause for the change of Twin A s articulation ability from assessment 3 to assessment 4, P 4 P 3, is a little more complicated. Apart from the effect of the natural improvement T A in that period, Twin A also received treatment. However, treatment was applied after treatment 1, and we are not sure if there is a carry-over effect of treatment 1 to treatment. This means, if Twin A didn t receive treatment 1 before, then the effect of treatment may be different. So, to be careful, we d better say that the change of Twin A s articulation abilities from assessment 3 to assessment 4, P 4 P 3, is due to his natural improvement plus the effect of treatment after treatment 1 on Twin A. (Not the pure effect of treatment on Twin A, as we can t be sure that is the same thing!) If T 1,A denotes the effect of treatment after treatment 1 on Twin A, then we have: T 1,A = (P 4 P 3 ) T A. (8) 11

Combining (3) and (8), we have: T 1,A = P 4 P 3 P + P 1. (9) Since ˆP 4 is also a natural estimator of P 4, so we can get the natural estimator of T 1,A : T 1,A = ˆP 4 ˆP 3 ˆP + ˆP 1 = 19 8 5 + 7 = 13 0.565. (10) The situation for Twin B is similar. Let T B be the natural improvement of Twin B, T,B be the effect of treatment on Twin B, and T 1,B be the effect of treatment 1 after treatment on Twin B. Then we have (see Figure ): T B = P P 1 T B = ˆ P ˆ P 1 (11) T,B = (P 3 P ) T B T,B = ( ˆ P 3 ˆ P ) T B = ˆ P 3 ˆ P + ˆ P 1 (1) T 1,B = (P 4 P 3 ) T B T 1,B = ( ˆ P 4 ˆ P 3 ) T B = ˆ P 4 ˆ P 3 ˆ P + ˆ P 1 (13) Substituting in the values of T 1,B = 0. P ˆ 1, P ˆ, P ˆ 3 and P ˆ 4, we get T B = 0, T,B = 7 0.304 and Figure 3 re-expresses Figure to show the whole picture of your experiment and how the twins articulation abilities changed according to the natural improvements and the treatment effects, under the assumption of stable natural improvement for both twins. As already noted, this way of viewing the results of your experiment ideally would have been settled upon prior to seeing the data illustrated in Figure 1. 4.3.3 Confidence intervals for the treatment effects Now, we have obtained estimates of the effects of treatments and the effects of natural improvement. How can we decide if those effects are significant? In Section 4., we described the use of a confidence interval to judge if the change seen between any two assessments is significant, and illustrated how to determine if the natural improvements are significant. The same approach can be used to answer the question of whether the effects of 1

Figure 3: How the overall average probabilities change with stable natural improvement treatments are significant. Consider the effect of treatment 1 on Twin A as an example. We have T 1,A as an estimate of T 1,A. To assess the significance of the effect of treatment 1 on Twin A is to judge if T1,A is large enough. If so, we may be confident to say that T 1,A is larger than 0; that is, the data provide convincing evidence that treatment 1 did have an effect on Twin A. To judge if T 1,A is large enough, we need to construct a confidence interval for T 1,A. Thus, we need to know the standard error of T 1,A, which tells us how precise T 1,A is as an estimator of T 1,A. From (7), we have: T 1,A = ˆP 3 ˆP + ˆP 1 = 5 0.17. Recall that, in Section 4., we have the formula for the standard error of each individual ˆPi : SE( ˆP ˆP i (1 i ) = ˆP i ). (14) Then, how can we compute the standard error of T 1,A? Note that T 1,A is a linear combination of the ˆP i s; that is, T 1,A can be expressed as Σ 4 i=1 c ˆP i i, where the c i s are known constants. From (7), for T 1,A, we have c 1 = 1, c =, c 3 = 1 and c 4 = 0. Under the assumption that the results from different assessments on the same boy are independent, the standard error of any linear combination of the ˆP i s can be expressed in terms of the standard errors of the individual ˆPi s. In particular: SE(Σ 4 i=1c i ˆPi ) = Σ 4 i=1 c i SE( ˆP i ). (15) 13

According to (7), the standard error of T 1,A, can be expressed as following: = SE( T 1,A ) = SE( ˆP 3 ) + 4 SE( ˆP ) + SE( ˆP 1 ) ( 8/(1 8/) ) + 4( 5/(1 5/) ) + ( 7/(1 7/) ) 0.1. (16) Then, as before, the 95% approximate confidence interval for T 1,A is given by: T 1,A ± 1.96 SE( T 1,A ) 0.17 ± 1.96 0.1 [ 0.14, 0.650]. Since the 95% approximate confidence interval for T 1,A contains 0, the effect of treatment 1 on Twin A is not significantly different from 0. If you would like to see if the effect of treatment on Twin B is significant or not, then you could construct a confidence interval for T,B in exactly the same fashion, and see if that interval contains 0. From (1), we have: T,B = ( ˆ P 3 ˆ P ) T B = ˆ P 3 ˆ P + ˆ P 1 0.304. Similarly as for T 1,A, the standard error of T,B, can be expressed as: = SE( T,B ) = SE( ˆ P 3 ) + 4 SE( ˆ P ) + SE( ˆ P 1 ) ( 18/(1 18/) ) + 4( 11/(1 11/) ) + ( 11/(1 11/) ) 0.48. (17) Then, the 95% approximate confidence interval for T,B is given by: T,B ± 1.96 SE( T,B ) 0.304 ± 1.96 0.48 [ 0.18, 0.791]. Since the 95% approximate confidence interval for T,B also contains 0, the effect of treatment on Twin B is not significantly different from 0. You might also be interested in and T 1,A, the effect of treatment after treatment 1 on Twin A, T 1,B, the effect of treatment 1 after treatment on Twin B. From (10) and (13), those two estimates are also linear combinations of the ˆP i s and the P ˆ i s, so you could calculate their standard errors by (15), and then construct the 95% confidence intervals to answer the question by seeing if the intervals contain 0. 14

In fact, from (10) and (13), we have: T 1,A = ˆP 4 ˆP 3 ˆP + ˆP 1 = 13 0.565, T 1,B = P ˆ 4 P ˆ 3 P ˆ + P ˆ 1 = 0. Then, the standard errors of T1,A and T 1,B are: = SE( T 1,A ) = SE( ˆP 1 ) + SE( ˆP ) + SE( ˆP 3 ) + SE( ˆP 4 ) ( 7/(1 7/) ) + ( 5/(1 5/) ) + ( 8/(1 11/) ) + ( 19/(1 19/) ) 0.181, (18) SE( T 1,B ) = SE( P ˆ 1 ) + SE( P ˆ ) + SE( P ˆ 3 ) + SE( P ˆ 4 ) = ( 11/(1 11/) ) + ( 11/(1 11/) ) + ( 18/(1 18/) ) + ( 18/(1 18/) ) 0.191. (19) So the 95% approximate confidence interval for T 1,A is given by: T 1,A ± 1.96 SE( T 1,A ) 0.565 ± 1.96 0.181 = [0.10, 0.90], and the 95% approximate confidence interval for T 1,B is given by: T 1,B ± 1.96 SE( T,B ) 0 ± 1.96 0.191 = [ 0.374, 0.374]. It follows that the effect of treatment after treatment 1 on Twin A is significantly different from 0, but the effect of treatment 1 after treatment on Twin B is not significantly different from 0. 4.3.4 Estimating the treatment effects in the absence of carry-over effects In the last subsection, we discussed the situation when the treatment in the previous four month time period has a carry-over effect on the boy in the later four month time period. Therefore, when assessing the effects of treatments in the time period between assessment 3 and assessment 4, we chose our words carefully by using T 1,A and T 1,B, which are the effect of treatment after treatment 1 on Twin A and the effect of treatment 1 after treatment on Twin B. We derived their natural estimators, and constructed the corresponding 95% approximate confidence 15

intervals in the last subsection. Now, if you are confident there are no such carry-over effects, which means the treatment applied in the period between assessment and assessment 3 has no effect on the twins in the later period, then the effect of treatment after treatment 1 on Twin A is just the effect of treatment on Twin A, and the effect of treatment 1 after treatment on Twin B is just the effect of treatment 1 on Twin B (see Figure 3). Let T,A be the effect of treatment on Twin A and T 1,B be the effect of treatment 1 on Twin B. Then: T,A = T 1,A T,A = T 1,A = 13 0.565, (0) T 1,B = T 1,B T 1,B = T 1,B = 0. (1) Since the expressions for T,A and T 1,B are the same as the expressions for T 1,A and T 1,B, the 95% confidence intervals for T,A and T 1,B are the same as the 95% confidence intervals for T 1,A and T 1,B which we have already discussed in Section 4.3.3. Figure 4 re-expresses Figure 3 under the assumption that there are no carry-over effects. As a result, we have 4 treatment effects: T 1,A the effect of treatment 1 on Twin A, T,A the effect of treatment on Twin A, T 1,B the effect of treatment 1 on Twin B, and T,B the effect of treatment on Twin B. Figure 4: How the overall average probabilities change with no carry-over effect 16

4.3.5 One further step: average treatment effects We have assumed that the effect of treatment 1 on Twin A and on Twin B are different, so we have T 1,A and T 1,B respectively. Similarly, we have T,A and T,B respectively. You might be interested in combining the information from Twin A and Twin B together in order to get a general effect of treatment 1 and a general effect of treatment. Let s first focus on treatment 1. A natural way of combining T 1,A and T 1,B is to average them. Let T 1 be the average effect of treatment 1, then: T 1 = T 1,A + T 1,B. () From (7) and (1), we have T 1,A 0.17 and T 1,B = 0, thus the natural estimator of this general effect of treatment 1 is: ˆT 1 = T 1,A + T 1,B 0.17 + 0 0.109. () To test if this average treatment effect is significantly different from 0, we could still use the confidence interval method. First, the standard error of ˆT 1 is given by: SE( ˆT 1 ) = 1 SE( T 1,A ) + SE( T 1,B ). (4) From (16) and (19), we obtained SE( T 1,A ) 0.1 and SE( T 1,B ) 0.191. So: SE( ˆT 1 ) 1 0.1 + 0.191 0.146, (5) which leads to the confidence interval for T 1 : ˆT 1 ± 1.96 SE( ˆT 1 ) 0.109 ± 1.96 0.146 [ 0.177, 0.395]. Since the 95% approximate confidence interval for T 1 contains 0, the average effect of treatment 1 is not significantly different from 0. Similarly, from (1) and (0) the estimate of the average effect of treatment is given by: ˆT = T,A + T,B 0.304 + 0.565 0.435. (6) From (17) and (18), the standard error of ˆT is given by: SE( ˆT ) = 1 SE( T,A ) + SE( T,B ) 1 0.181 + 0.48 0.154, (7) 17

which leads to the 95% approximate confidence interval for T : ˆT ± 1.96 SE( ˆT ) 0.435 ± 1.96 0.154 [0.133, 0.737]. Since the 95% approximate confidence interval for T doesn t contain 0, the average effect of treatment is significantly different from 0. More specifically, this confidence interval contains only positive values, indicating that treatment has a beneficial effect as is suggested in Figure 1. 4.3.6 Finally: the difference of the average treatment effects You might also be interested in comparing the effect of treatment 1 to the effect of treatment. From the above analysis, we know that the average effect of treatment 1 is not significantly different from 0, but the average effect of treatment is significantly different from 0. Can we then claim that the average effect of treatment is significantly higher than the average effect of treatment 1? This conclusion does not logically follow as each of those confidence intervals is simply comparing the corresponding average effect to 0. So, for example, if the average effect for treatment was only a small amount larger than the average effect for treatment 1, such confidence interval would be obtained. In order to assess the difference between two average effects, we need to construct the 95% approximate confidence interval for T T 1. From () and (6), we know that: ˆT ˆT 1 = T,A + T,B T 1,A + T 1,B. From previous sections, we know that T 1,A, T 1,B, T,A and T,B can be expressed as some linear combinations of the P i s and the P i s, so ˆT ˆT 1 can also be expressed as a linear combination of the P i s and the Pi s. In fact, after some simplification, we get: ˆT ˆT 1 = ( ˆP 4 ˆP 3 + ˆP ) ( P ˆ 4 P ˆ 3 + P ˆ ). (8) Substituting in the values of the P i s and the P i s, we have ˆT ˆT 1 0.36. 18

From (15), we can calculate the standard error of ˆT ˆT 1 as: SE( ˆT ˆT 1 ) = 1 SE( ˆP 4 ) + 4SE( ˆP 3 ) + SE( ˆP ) + SE( P ˆ 4 ) + 4SE( P ˆ 3 ) + SE( P ˆ ) 0.159, (9) which leads to the 95% approximate confidence interval for ˆT ˆT 1 : ( ˆT ˆT 1 ) ± 1.96 SE( ˆT ˆT 1 ) 0.36 ± 1.96 0.159 [0.014, 0.68]. The 95% approximate confidence interval for T T 1 doesn t contain 0, so the two average effects of treatments are significantly different. Since this confidence interval only contains positive values, you can conclude that treatment has a better effect than treatment 1 on these two boys. 5 Discussion Potential effect of the person who gives the treatment In your experiment, treatment 1 was given by one teacher and treatment was given by another teacher. Since different teachers may have different teaching styles, they may affect the learning behaviors of the children. If the teacher who gave treatment 1 gave treatment and the teacher who gave treatment gave treatment 1, then the result of every assessment might be different. So, strictly speaking, every treatment effect we have discussed should be referred to as the effect of the treatment combined with the effect of the teacher. However, in your experiment, we can t separate the effect of treatment from the effect of the teachers. In order to separate them, you would need to have the same treatment given by different teachers. About the confidence interval When we construct a 95% confidence interval, we use the form of ±1.96 SE(). Why did we use the number 1.96? The number 1.96 is associated with the 95% coverage probability of the confidence interval. For example, if the 95% confidence interval for T 1 doesn t contain 0, then we claim the average effect of treatment 1 is significantly different from 0. However, this claim only has 95% chance of being correct. You could increase the number 1.96 to obtain a higher coverage probability, or decrease the number 1.96 to obtain a lower coverage probability. We used 1.96 because 95% coverage probability is the most common choice in most subject areas. 19

About the standard error All the standard errors we mentioned can be expressed as the linear combination of the standard errors of the ˆP i s, the P ˆ i s, or both. So, the basic unit of all the standard errors are the standard errors of the ˆP i or the P ˆ i. From Section 4., we know: SE( ˆP ˆP i (1 i ) = ˆP i ). You can see that the denominator in the above standard error is, which is the number of the test items in each assessment. In fact, the denominator of the above standard error will always be the number of the test items no matter how many test items there are. If the number of test items is 50, then the denominator is 50; if the number of test items is 3, then the denominator is 3. If N is the number of test items, then: SE( ˆP ˆP i (1 i ) = ˆP i ). N Thus, the larger N is, the smaller SE( ˆP i ) is. This is also true for SE( ˆ P i ). The standard error of any estimator we mentioned is based on SE( ˆP i ) and SE( P ˆ i ), so the larger N is, the smaller the standard error of that estimator is. A smaller standard error means that the estimator is more precise. In our previous discussion, you can see that some effects of treatments or some changes seen between two assessment are not statistically significant. This is because the standard error of the estimator of that effect or change is too large due to the small N. We can t precisely estimate that effect or change with relatively little information. So in future studies, provided this is practical and all the assumptions made would continue to hold, you could get more precise estimators by increasing N, the number of test items. 0