Impact Evaluation of Mindspark Centres

Impact Evaluation of Mindspark Centres March 27th, 2014 Executive Summary About Educational Initiatives and Mindspark Educational Initiatives (EI) is a prominent education organization in India with the mission of ensuring that every child learns with understanding. One of EI s initiatives is an adaptive learning program called Mindspark to teach primary school students mathematics and language according to their current level of understanding. Mindspark has been developed through student usage in classrooms over the past 5 years, and is currently used in approximately 150 government and private schools in India and the Middle East. Additionally, EI has experimented with providing an adapted version of the Mindspark platform to low income students in South Delhi through the establishment of five fee- for- service centres. About the Evaluation This evaluation builds on previous retrospective impact evaluation work that IDinsight has done for EI focused on the use of Mindspark in schools. 1 This evaluation, as compared to the previous, focus on the use of an adapted Hindi version of the Mindspark platform geared towards students in low- income communities in South Delhi. IDinsight, a client- focused impact evaluation firm, designed this follow- on evaluation to answer these questions about the impact of Mindspark Centre attendance: 1. Primary: What is the impact of using Mindspark on student learning outcomes in mathematics and language? 2. Secondary: Is there a relationship between the intensity of Mindspark usage and learning gains? This evaluation matched students in Mindspark centres who used the platform extensively with those who did not, pairing students of similar age, gender, school type (public vs. private), learning level at baseline, and exposure to Mindspark before the baseline. Impact was measured using a difference- in- differences regression approach comparing student performance on quarterly tests administered by EI between high treatment and low- treatment groups. Key Findings This evaluation finds indicative, but not statistically significant results of impact from the Mindspark platform on both language and math test scores, with slightly larger gains in Language scores than Math. Given the short time frame and other constraints in this analysis, the results from this study are encouraging. With a larger sample, it is possible that these results would be statistically significant. Recommendations IDinsight believes a randomized controlled trial is best suited to confirm these early, positive results. Given its cost and time commitment, IDinsight recommends that an RCT should only be pursued after the operational model for low cost Mindspark use has been determined. 1 This report was drafted by Andrew Fraker, IDinsight in April 2014. Language in this document is based on and in some cases directly copied from this report. 1

Table of Contents Executive Summary 1 ABOUT EDUCATIONAL INITIATIVES AND MINDSPARK 1 ABOUT THE EVALUATION 1 KEY FINDINGS 1 RECOMMENDATIONS 1 Evaluation Design 3 ABOUT THE INTERVENTION: MINDSPARK 3 EVALUATION QUESTIONS 3 EVALUATION METHOD 3 POTENTIAL LIMITATIONS OF THE EVALUATION 4 TEST INSTRUMENT AND OUTCOME VARIABLES 5 DATA DESCRIPTION 5 SAMPLE FRAME 6 DATA ANALYSIS 6 MATCHING RESULTS 7 Results 8 PRIMARY RESULTS 8 EXPLORING USAGE 10 Recommendation 11 Appendices 12 MATH TREATMENT EFFECT ESTIMATES FOR 25 MATCHING SPECIFICATIONS 12 LANGUAGE TREATMENT EFFECT ESTIMATES FOR 25 MATCHING SPECIFICATIONS 12 MATCHING QUALITY TABLES MATH 13 MATCHING QUALITY TABLES LANGUAGE 14 FULL MATCHING RESULTS MATH 15 FULL MATCHING RESULTS LANGUAGE 16 FULL REGRESSION TABLE MATH 17 FULL REGRESSION TABLE LANGUAGE 18 ABOUT IDINSIGHT 19 2

Evaluation Design About the Intervention: Mindspark 2 Mindspark Centres combine a hybrid model of education combining in person tutoring with use of the Mindspark adaptive learning software. Enrolment in the centres includes: Daily 90- minute sessions split between platform usage and in- person instruction. Platform usage alternates between 45 minutes of Hindi or 45 minutes of Math instruction In- person instruction consists of small group instruction, homework support, or exam support The students attending Mindspark centres come from government and low- income private schools in South Delhi. These schools typically charge Rs. 400 to 500 Rs. / month for fees. Government students make up 60% of the total Mindspark center enrolment. Compared to IDinsight s previous evaluation, these students are coming from significantly poorer economic backgrounds. Evaluation Questions 1. Primary: What is the impact of using Mindspark on student learning outcomes in maths and language? 2. Secondary: Is there a relationship between the intensity of Mindspark usage and learning gains? Evaluation Method The evaluation matched students with similar characteristics that used the platform extensively (14 to 17 hours on average for each subject) during a three- month period and those who did not use it extensively (less than 7 hours for each subject). This resulted in a treatment group with approximately twice as much time on the platform as the control group. The three- month period was chosen so that quarterly tests administered by EI could be used as baseline and endline tests. Propensity score matching was used to create matches of children in the high treatment group with those who are as identical as possible in the low treatment group. Matching for Math and Language analyses were conducted separately. The below table lists the variables considered for matching and whether they were included in the final matching specification. Variables were included in the matching specification based on their imbalance between treatment and control groups as well as their expected importance in influencing the outcome variable. Table 1: Matching Variables Variable(s) Included in primary specification? Description Child Class Yes Grade in school at time of enrolment School Type Yes Government vs. Private School Gender Yes Student s gender 2 The language used in this section is largely based on Mindspark s website: http://centres.mindspark.in/about- mindspark- centres.php 3

Baseline Learning Level Yes Learning level, as measured by MS platform at time of baseline 3 Platform Usage Pre- Baseline Mindspark Centre Yes Yes Math No Language Dropped from Language due to large reduction in sample size. Controlled for statistically Amount of time (measured in minutes) that a student spent on the Mindspark platform pre- baseline. Quarterly tests are administered at consistent dates, rather than when a student enters Mindspark. Mindspark centre attended by students Tuition Controlled for statistically Whether the student attends after school Tuition sessions or not Season Controlled for statistically Time of year test was taken Of potential matching techniques, we employed the propensity score matching technique as it works better than the main alternative of coarsened exact matching when working with smaller data sets. The matching process created a comparison group with similar means and distributions on key variables and passed all standard tests used to evaluate the quality of propensity score matches. Potential Limitations of the Evaluation Ideally a randomized controlled trial (RCT) would have been used to evaluate this program. This approach was attempted by the research team in November 2014, but was not possible due to sample size constraints. The comparative advantage of an RCT over a matching design is that the former helps achieve balance on unobservable or hard to measure characteristics such as student or teacher motivation. The methodology employed in this evaluation using EI s own data, matching with difference in differences analysis, controls for observable characteristics that are in the dataset, but leaves open the possibility of systematic differences in unobserved characteristics, and observable characteristics not captured in the dataset. The key potential source of bias for this matching study are the unobserved characteristics that may have determined why some students used the platform more than others, aside from the variation explained by covariates. For example, it is possible that highly motivated students formed the high- usage (treatment) group compared to the low- usage (control) group. These highly motivated students, even in the absence of the Mindspark program, may have improved their learning at a faster pace. The analysis is unable to fully correct 4 for this potential bias. Another potential limitation of the study is the lack of a pure control group. The control group for this analysis is comprised of students who did not use the platform intensively rather than students who did not use the platform at all. IDinsight s previous analysis of Mindspark (in high cost schools) 3 Note this baseline learning level refers to the final learning level that is determined by a series of 1 to 3 tests in the students first round of quarterly testing at the Mindspark center 4 Difference- in- differences analysis controls for unobservable characteristics that do not change over time by examining the difference between baseline and endline scores. It does not control for unobservable variables that could fluctuate over time. 4

suggests that there is a period of low usage where the Mindspark platform has little to no effect on student outcomes. This mitigates, but does not remove this potential bias. The final potential limitation of this study is the relatively short treatment period. Due to limits on an available control group with low platform usage, the study sample was restricted to a three- month period. IDinsight s previous study of the Mindspark platform (in high cost schools) found an impact when measured after use for an entire year. Some of this concern is mitigated by focusing on students with high usage (higher than would happen at school), but overall cannot be discounted. Test Instrument and Outcome Variables Quarterly tests administered by EI on the Mindspark platform served as the test instrument for this study. Since its inception, Mindspark Centres have tested all enrolled students every three months with two separate tests, math and language. For each student, the computer randomly draws a set of questions to comprise the test from Mindspark s overall bank of questions. The questions are selected to test key learning concepts that the student should learn at their presently measured learning level. No two students receive the exact same test. In cases where students do exceedingly well or poor on their initial test, a second test is given to the student at the grade level above or grade level below. Based on these test results, the students estimated learning level may be changed. Given that no two tests are identical, the analysis for this report controls for the difficulty of each question that the child faces. The difficulty of each question is calculated by measuring the rate at which each question is answered correctly when attempted during regular use of the platform by all students (non- quarterly test attempts) and on quarterly tests by students who are not in the sample frame. All questions that did not have a sufficient number of attempts to measure question difficulty were dropped, resulting in roughly 10% of all test questions being removed from consideration. 5 The final outcome variable from the quarterly testing data was structured as a binary variable for each question. 6 Data Description Educational Initiatives (EI) provided all data to IDinsight. User Characteristics Data User characteristics data on Mindspark students was gathered from Mindspark s center enrolment processes. In cased where data was missing or uncategorized, the research team worked with the Mindspark staff to code and clean the data. 7 Platform Usage Data Platform usage data was measured by taking the sum of all individual session times recorded on the Mindspark platform. Checks for correlation were performed with the number of total questions answered during these sessions. 5 Note: to achieve this 10% cutoff different thresholds were used for the math and language datasets. A minimum of 100 questions was required for the Math dataset and a minimum of 50 questions was required for the Language dataset. These adjustments were made to strike a balance between accuracy of difficulty measurement and loss in sample size. 6 May language questions contained multiple sub- questions within a given question. If more than half of the questions was answered correctly, as per standard EI grading processes, the question was marked correct. 7 This included school type and class / age data provided by EI. 5

Sample Frame The sample frame for this evaluation was constructed from the overall pool of Mindspark students to create a treatment and control group as similar as possible. Students who met the following restrictions were added to the sample frame. Students who complete at least two, consecutive quarterly tests with Mindspark after their initial enrolment at the centres. Students with a baseline test that was given at the same learning level as the endline test. In these cases, this learning level will also match the learning level of content taught during the intervention period. Students who accessed Mindspark during the regular operations of the centres. This excludes students who used Mindspark through one- off partnerships with schools. Students with at least ten eligible questions from both the baseline (first test) and endline (second test). Non- eligible questions include those that did not have enough data external to the evaluation to judge the difficulty of each question, as described above. Students who were not missing key covariate data (school type, tuitions, etc.). The final sample frame for math contained 735 students and for language contained 788 students. Data Analysis Overview The research team created a pre- analysis plan specifying matching techniques and regression specifications prior to analysis. During sensitivity analysis, it was initially observed that adjusting the sample frame sizes slightly led to very different weighting for certain observations. To reduce dependence on a single, highly weighted observation, the research team ran 25 different matching specifications (using five small adjustments of the control group s size and five small adjustments of the treatment group s size) and has reported the median effect. The results of all 25 matching specifications can be found in the appendix. Regression Specifications Following the matching process, the main data analysis technique is difference- in- differences 8 with the following specification: Correct = β 0 constant + β 1 treatment*endline + β 2 treatment + β 3 endline + β 4 difficulty + β 5 center + β 6 school_type + β 7 gender + β 9 child_class + β 10 minutes_pre + β 11 learning_level + β 12 season + β 13 tuition + ε Note: this analysis is clustered at the student level and uses frequency weights for control observations based on the results of the matching process. For this analysis data is arranged long where a dummy variable differentiates between baseline and endline observations Where variables are defined as: treatment*endline coefficient of impact estimate; effect of treatment treatment treatment group binary variable endline baseline / endline binary variable 8 Note that instead of subtracting baseline test score values, this specification statistically controls for baseline levels, which maximizes statistical power. 6

difficulty calculated difficulty of each question, based on average rate students get question correct centre categorical variable representing 5 different Mindspark centres school_type government vs. private school binary variable gender gender binary variable child_class child s class at time of enrolment minutes_pre minutes spent on the platform before baseline test learning_level learning level in subject at end of baseline, beginning of endline, and the intervening period. By definition of the sample frame, all three of these values will be equal season which quarter test was taken tuition binary variable indicating whether student is enrolled in tuitions ε error term Standard errors were clustered at the student level. Propensity score calculated matching weights were employed. Matching Results The following graphs demonstrate how well the matching process worked in creating two nearly identical groups. As a reminder, for the main impact measure, the median result was taken from 25 different matching specifications. These results are indicative 9 of the remaining 24 specifications not shown (available in Appendix). Each graph shows percent bias difference between treatment and control pre- matching (dots) and post- matching (x s). Before matching, treatment and control groups differed greatly on important dimensions such as time spend on the platform pre- baseline and learning level at baseline. The matching process for both the Mathematics and Language samples created two groups that were as close to identical as possible for final analysis. Graph 1: Matching Results Language (Left) and Math (Right) 10 9 Note: the sample frame size combination with the median impact in Math represented one of the poorer matches of the 25 tested. While these matches are still a drastic improvement over pre- matched combinations, 23 out of 24 other matching configuration outperformed this instance on all standard matching metrics (Chi- square, Rubin s B, Rubin s R) 10 Note: the center variable was omitted from the Language matching process because it drastically reduced the number and quality of matches. 7

In order to conduct a meaningful comparison between these two matched groups, the treatment and control group were required to have a substantial difference in platform usage between baseline and endline testing. The below table shows platform usage, in hours, by treatment control and by time period (time spent on the platform pre- baseline and during the intervention period). Table 2: Platform Usage Pre- and Post- Baseline by Subject Group N Pre- Baseline Control 166 4.77 (2.96) Treatment 237 5.21 (3.85) Mean Math Usage (in hours) with Standard Deviation Intervention Period 6.98 (1.82) 14.31 (2.29) Mean Language Usage (in hours) with Standard Deviation Intervention Total Period Total N Pre- Baseline 11.75 (3.58) 19.52 (4.71) 172 6.62 (4.59) 232 6.81 (4.97) 6.96 (1.79) 17.98 (1.89) 13.58 (5.01) 24.79 (5.60) Results Primary Results In both evaluations, Mindspark had a positive, although not statistically significant effect, on student s test scores in math and language. Both results are indicative of impact, but not conclusive with the treatment estimates larger for impact on language test results than math test results. With a larger sample size, it is possible that these results would be statistically significant at the given estimate. For math questions, students who used Mindspark more intensely (the treatment group) were an estimated 14% 11 (Odds Ratio 1.14) more likely to answer a given question correctly [Odds Ratio, 95% CI: 0.94, 1.37]. The p- value for this estimate is 0.19. For language questions, students who used Mindspark more intensely (the treatment group) were an estimated 22% (Odds Ratio 1.22) more likely to answer a given question correctly [Odds Ratio, 95% CI: 0.92, 1.62]. The p- value for this estimate is 0.16. When results are not statistically significant, as in this case, this means that there is a reasonable chance that the positive estimate is due to chance and not true impact. For the Math results, 4 out of the 25 tested matching specifications produced statistically significant estimates providing more evidence that a statistically significant impact may occur with a larger sample size or longer exposure period to the platform. The treatment effect estimates are large considering the short time frame of the intervention. Another way to express these results is as the change in fraction of questions answered correctly. 12 For math questions, students who used Mindspark more intensely (the treatment group) scored 2.9 11 Note this is 9% not 9 percentage points 12 This entails taking the derivative of the logistic function with respect to the treatment effect variable. In a linear regression, the estimated coefficients are the derivatives, because the derivative of a linear function is just the (constant) slope. For logistic regression, however, the point estimate and derivative differ because of 8

percentage points better than the low usage group. This corresponds to a 0.06 standard- deviation effect size. 13 For language, the high usage group saw a 6.5 percentage points increase in their scores due to the platform, which corresponds to a 0.13 standard deviation increase in test scores. Because the odds ratio coefficients are not statistically significant, we cannot say that these percentage point changes are statistically significant, either. The language results, and the math result estimates if confirmed by further study, should be considered successful education interventions, particularly given the short time frame. Robustness of the Findings Additional analyses were employed to check the robustness of the findings. As discussed previously, 25 different matching specifications were employed for both analyses. This allowed the research team to observe the spread of potential treatment effect sizes for both interventions. The different matching specifications yielded different treatment effects, but were in all cases positive (full results in appendix). This gives the research team confidence that there is likely a positive effect in the case of both platform subjects, even if it cannot be precisely estimated with available data. Additionally, graphical evidence supports this view. Graphs below 14 compare treatment vs. control at baseline and endline using the fitted values from regression analysis. For each observation (question asked to a specific student), the probability of the student getting a question correct based on the full regression model is modelled and displayed in the histogram. In both cases, the groups appear largely balanced at baseline. In both math and language cases, there is clearly a greater rightward shift by treatment students rather than control students. There is also a noticeable rightward shift by the control group, all though less than the treatment group, between baseline and endline. This could be attributable to the time control students spent on the platform or in school or a combination of two. Graph 2: Math and Language Probability Estimates for Baseline vs. Endline the nonlinearity of the logistic function. Hence, we can express the outcomes as either a change in likelihood due to Mindspark or as a percentage point change. 13 Note that because this is a matching study with binary outcomes, there is a choice of which standard deviation to use. We have used the standard deviation of the comparison group at baseline. 14 The full regression specifications from which these graphs were created are included in the index. 9

One concern of education studies that do not use randomized controlled trials is that there might be school or teacher effects that is, the schools and teachers that students in the treatment group attended might have already been on a faster learning trajectory, even without Mindspark. Matching was unable to control for this possibility due to lack of school or teacher level data. The treatment effect estimates are not correlated with quality of matches. Exploring Usage As a secondary analysis, the research team examined the relationship between platform usage and percent of questions correctly answered. It is important to note that these results are not causal. We are unable to construct a control group for this analysis. This analysis also does not control for other covariates such as school type or tuition attendance of the student, which may also influence a student s performance. Overall, there is a positive relationship between usage and answer correctness, with some evidence of diminishing returns to time spent on Mindspark. It should be noted that in both cases, language and mathematics, there is little data at both end of usage distribution (0 usage and 20 hours plus). As such, the shape of the curves at both of these locations should be treated as less reliable. This is demonstrated by the widening of the confidence intervals at these times. The below two graphs show these relationships visually with usage vs. correctness shown on the top graph and the distribution of usage data in the bottom graph Please note that this analysis pools baseline and endline tests, treatment and control students, and also includes students with medium amounts of usage that were cut out of the analysis to construct a better experiment. 10

Graph 3: Test Scores vs. Platform Usage Time by Subject ` Recommendation We believe the findings in this report are indicative, but not conclusive of impact by the Math portion of the platform on learning outcomes. The results from the language testing data are particularly promising, given the large estimated treatment effect estimate in a short period of time. IDinsight stops short of calling these results conclusive due to the inability of the matching study to control for potential forms of biases. To validate these indicative results, a randomized controlled trial would be useful. However, IDinsight encourages EI to focus in the short term in finalizing its delivery model for this Mindspark platform, whether it continues to be in Mindspark Centres or whether it expands into low cost and government schools in Delhi and other regions. After this operational model is finalized, IDinsight believes a randomized controlled trial can best evaluate the effect of the Mindspark platform in the setting at which it would scale. This would allow more precise measurement of the impact of this highly promising learning tool for vulnerable children across India. 11

Appendices For each sample frame size combination the odds ratio and the associate p- value is included. The odds ratio tells the reader the relative likelihood of a binary outcome being equal to 1 as a single covariate or treatment variable changes. In this case, the odds ratio for the treatment effect shows the relative likelihood of a student in the treatment group getting a single question correct compared to a student in the control group. An odds ratio of > 1 means a student in the treatment group is more likely to a question correct than a student in the control group, all else equal. An odds ratio of 1.09 can be interpreted as a 9- percent (not percentage point) increase in the likelihood of a correct response. Math Treatment Effect Estimates for 25 Matching Specifications Four of the 25 specifications are significant at p <0.10 level, one of which is also significant at the p<0.05 level. # of Treatment Observations 240 245 250 255 260 # of Control Observations 340 345 350 355 360 1.14 (0.16) 1.10 (0.31) 1.16 (0.10)* 1.14 (0.15) 1.11 (0.26) 1.12 (0.22) 1.11 (0.28) 1.14 (0.14) 1.14 (0.19) 1.12 (0.20) 1.14 (0.17) 1.13 (0.19) 1.12 (0.21) 1.12 (0.23) 1.02 (0.81) 1.13 (0.18) 1.15 (0.12) 1.16 (0.10) 1.11 (0.28) 1.08 (0.38) 1.15 (0.13) 1.17 (0.09)* 1.14 (0.16) 1.16 (0.10)* 1.20 (0.03)** Language Treatment Effect Estimates for 25 Matching Specifications None of the 25 specifications are significant at p <0.10 level # of Control Observations 390 395 400 405 410 # of Treatment Observations 240 245 250 255 260 1.25 1.28 1.24 1.25 1.26 (0.13) (0.11) (0.13) (0.13) (0.11) 1.27 (0.11) 1.22 (0.16) 1.19 (0.23) 1.19 (0.21) 1.28 (0.11) 1.19 (0.23) 1.16 (0.30) 1.13 (0.38) * = p<0.10, ** = p<0.05, *** = p <0.01 1.23 (0.15) 1.21 (0.19) 1.21 (0.19) 1.16 (0.28) 12 1.23 (0.15) 1.25 (0.12) 1.22 (0.17) 1.22 (0.14) 1.23 (0.15) 1.19 (0.21) 1.26 (0.13) 1.22 (0.14)

Matching Quality Tables Math Table shows the p- values for a difference of means t- test for all matching variables before and after matching. The final two columns show the results of a joint significance test. Numbers closer to 1 indicate parity between treatment and control groups Learning Usage Pre- Class Gender Treatment Control Level Baseline School Type Chi- Squared (N) (N) Pre Post Pre Post Pre Post Pre Post Pre Post Pre Post 340 0.00 0.15 0.41 0.21 0.00 0.73 0.00 0.34 0.31 0.83 0.00 0.48 345 0.00 0.09 0.35 0.15 0.00 0.68 0.00 0.17 0.32 0.78 0.00 0.18 240 350 0.00 0.38 0.35 0.07 0.00 0.58 0.00 0.28 0.33 1.00 0.00 0.45 355 0.00 0.15 0.39 0.32 0.00 0.53 0.00 0.09 0.33 0.42 0.00 0.19 360 0.00 0.04 0.40 0.45 0.00 0.76 0.00 0.07 0.30 0.31 0.00 0.08 340 0.00 0.57 0.41 0.37 0.00 0.44 0.00 0.16 0.24 0.54 0.00 0.56 345 0.00 0.34 0.34 0.48 0.00 0.51 0.00 0.10 0.25 0.66 0.00 0.44 245 350 0.00 0.02 0.35 0.13 0.00 0.63 0.00 0.07 0.26 0.26 0.00 0.01 355 0.00 0.01 0.39 0.16 0.00 0.99 0.00 0.17 0.26 0.42 0.00 0.02 360 0.00 0.10 0.40 0.24 0.00 0.66 0.00 0.18 0.24 0.29 0.00 0.21 340 0.00 0.30 0.46 0.65 0.00 0.74 0.00 0.17 0.19 0.47 0.00 0.59 345 0.00 0.25 0.39 0.75 0.00 0.69 0.00 0.37 0.20 0.58 0.00 0.66 250 350 0.00 0.06 0.40 0.58 0.00 0.92 0.00 0.11 0.21 0.66 0.00 0.14 355 0.00 0.07 0.44 0.52 0.00 0.93 0.00 0.11 0.21 0.87 0.00 0.14 360 0.00 0.08 0.45 0.24 0.00 0.87 0.00 0.20 0.19 1.00 0.00 0.26 340 0.00 0.29 0.46 0.50 0.00 0.61 0.00 0.09 0.20 0.40 0.00 0.34 345 0.00 0.28 0.38 0.85 0.00 0.41 0.00 0.18 0.20 0.51 0.00 0.34 255 350 0.00 0.50 0.39 0.52 0.00 0.60 0.00 0.14 0.21 1.00 0.00 0.73 355 0.00 0.52 0.44 0.89 0.00 0.67 0.00 0.17 0.21 0.66 0.00 0.78 360 0.00 0.25 0.44 0.52 0.00 0.72 0.00 0.14 0.19 0.79 0.00 0.47 340 0.00 0.13 0.35 0.93 0.00 0.59 0.00 0.23 0.19 0.62 0.00 0.19 345 0.00 0.17 0.28 0.68 0.00 0.67 0.00 0.22 0.20 0.38 0.00 0.24 260 350 0.00 0.18 0.29 0.82 0.00 0.53 0.00 0.07 0.20 0.79 0.00 0.18 355 0.00 0.13 0.33 0.68 0.00 0.83 0.00 0.22 0.20 0.29 0.00 0.29 360 0.00 0.65 0.33 0.93 0.00 0.23 0.00 0.16 0.19 0.35 0.00 0.25 13

Matching Quality Tables Language Table shows the p- values for a difference of means t- test for all matching variables before and after matching. The final two columns show the results of a joint significance test. Numbers closer to 1 indicate parity between treatment and control groups Treatment Control Class Gender Learning Usage Pre- School Type Chi- Squared (N) (N) Pre Post Pre Post Pre Post Pre Post Pre Post Pre Post 390 0.01 0.76 0.03 0.39 0.00 0.58 0.00 0.79 0.58 0.25 0.00 0.55 395 0.01 0.89 0.03 0.39 0.00 0.68 0.00 0.68 0.58 0.23 0.00 0.70 240 400 0.00 0.88 0.02 0.89 0.00 0.91 0.00 0.66 0.68 0.25 0.00 0.83 405 0.00 0.76 0.02 0.70 0.00 0.94 0.00 0.67 0.78 0.31 0.00 0.82 410 0.00 0.63 0.03 0.34 0.00 0.98 0.00 0.74 0.72 0.96 0.00 0.90 390 0.00 0.81 0.03 0.32 0.00 0.55 0.00 0.72 0.60 0.23 0.00 0.65 395 0.00 0.74 0.02 0.30 0.00 0.76 0.00 0.59 0.59 0.37 0.00 0.75 245 400 0.00 0.56 0.02 0.93 0.00 0.96 0.00 0.66 0.69 0.49 0.00 0.88 405 0.00 0.83 0.02 0.64 0.00 0.98 0.00 0.69 0.80 0.34 0.00 0.90 410 0.00 0.60 0.03 0.32 0.00 0.66 0.00 0.89 0.73 0.58 0.00 0.87 390 0.00 0.97 0.03 0.64 0.00 0.81 0.00 0.66 0.61 0.35 0.00 0.88 395 0.00 0.98 0.03 0.68 0.00 0.54 0.00 0.79 0.60 0.43 0.00 0.90 250 400 0.00 0.94 0.03 0.82 0.00 0.92 0.00 0.66 0.71 0.46 0.00 0.97 405 0.00 0.94 0.03 0.55 0.00 0.86 0.00 0.63 0.81 0.79 0.00 0.98 410 0.00 0.50 0.04 0.55 0.00 0.99 0.00 0.80 0.75 0.96 0.00 0.93 390 0.01 0.81 0.03 0.75 0.00 0.55 0.00 0.76 0.45 0.36 0.00 0.91 395 0.01 0.83 0.03 0.52 0.00 0.63 0.00 0.70 0.44 0.50 0.00 0.92 255 400 0.00 0.69 0.03 0.58 0.00 0.64 0.00 0.66 0.53 0.92 0.00 0.98 405 0.00 0.83 0.03 1.00 0.00 0.59 0.00 0.76 0.63 0.61 0.00 0.99 410 0.00 0.27 0.04 0.93 0.00 0.59 0.00 0.95 0.57 0.57 0.00 0.93 390 0.00 0.73 0.02 0.93 0.00 0.55 0.00 0.88 0.46 0.68 0.00 0.90 395 0.01 0.77 0.02 0.96 0.00 0.60 0.00 0.76 0.46 0.64 0.00 0.99 260 400 0.00 0.60 0.02 0.82 0.00 0.59 0.00 0.75 0.55 0.29 0.00 0.93 405 0.00 0.72 0.02 0.72 0.00 0.98 0.00 0.64 0.64 0.84 0.00 0.98 410 0.00 0.47 0.03 0.49 0.00 0.99 0.00 0.77 0.58 0.92 0.00 0.89 14

Full Matching Results Math This matching specification starts with a sample frame of 735 and uses frequency weights of 166 control observations to match 237 treatment observations. Unmatched Mean % t- test Variable Matched Treated Control Bias t p> t Center - GP U 0.32 0.28 9.8 1.17 0.24 M 0.32 0.33-1.4-0.15 0.88 Center - SV U 0.08 0.03 24.4 3.04 0.00 M 0.08 0.09-3.8-0.33 0.74 Center - TB U 0.16 0.06 34.5 4.28 0.00 M 0.16 0.12 12.9 1.24 0.22 Center - TK U 0.19 0.08 32.4 3.98 0.00 M 0.19 0.22-10 - 0.91 0.37 Student's Class U 5.43 4.61 37.7 4.49 0.00 M 5.43 4.92 23.3 2.62 0.01 Gender U 1.41 1.45-7.3-0.87 0.39 M 1.41 1.35 12.8 1.42 0.16 Math Learning Level U 3.88 2.81 79.5 9.50 0.00 M 3.88 3.88-0.2-0.02 0.99 Minutes Usage Pre Baseline U 312.68 164.02 77.4 9.61 0.00 M 312.68 286.31 13.7 1.39 0.17 School Type U 0.22 0.26-9.5-1.12 0.26 M 0.22 0.26-7.4-0.81 0.42 Sample Pseudo R2 LR chi2 p<chi2 Mean Bias Median Bias B R %Var Unmatched 0.22 172.64 0.00 34.70 32.40 120.5* 1.59 25 Matched 0.03 20.36 0.02 9.50 10.00 41.7* 1.09 50 15

Full Matching Results Language This matching specification starts with a sample frame of 788 and uses frequency weights of 172 control observations to match 232 treatment observations. Unmatched Mean % t- test Variable Matched Treated Control Bias t p> t Student's Class U 5.42 4.87 25.5 3.09 0.00 M 5.42 5.45-1.4-0.15 0.88 Gender U 1.41 1.49-17.3-2.09 0.04 M 1.41 1.40 1.3 0.14 0.89 Language Learning Level U 5.25 4.03 69.8 8.51 0.00 M 5.25 5.23 1 0.11 0.91 Minutes Usage Pre Baseline U 408.80 146.22 113 15.03 0.00 M 408.80 397.13 5 0.44 0.66 School Type U 0.24 0.22 3.7 0.44 0.66 M 0.24 0.28-11.3-1.16 0.25 Sample Pseudo R2 LR chi2 p<chi2 Mean Bias Median Bias B 15 R %Var Unmatched 0.25 210.35 0.00 45.80 25.50 126.8* 3.06* 25 Matched 0.00 2.11 0.83 4.00 1.40 13.4 0.71 25 15 B and R represent Rubin s B and Rubin s R measures of bias. Matching is considered unbiased if B < 25% and R is between 0.5 and 2. 16

Full Regression Table Math Logistic Regression Results Table - Mathematics Number of obs = 48,077 (Result of Frequency Weights) Unique Observations = 20,313 Responses Prob > chi2 = 0.000 Pseudo R2 = 0.099 (Std. Err. adjusted for 404 clusters in userid) Variable Odds Ratio P- Value [95% Conf. Interval] Difference- In- Difference Model Treatment 1.01 0.93 0.86 1.18 Endline 1.44 0.00 1.22 1.69 Treatment * Endline 1.14 0.19 0.94 1.37 Math Learning Level 0.94 0.17 0.85 1.03 Student's Class 1.07 0.01 1.02 1.13 Tuition 1.05 0.53 0.90 1.22 Minutes Usage Pre- Baseline 1.00 0.00 1.00 1.00 Question Difficulty 0.01 0.00 0.01 0.02 Gender (Ref - Male Students) Female Students 0.88 0.06 0.77 1.01 School Type (Ref - Govt School) Private School 1.09 0.39 0.90 1.32 Season (Ref - May) Aug 0.87 0.11 0.73 1.03 Nov 0.91 0.34 0.75 1.10 Feb 0.89 0.16 0.76 1.04 MS Center (Ref - CP) GP 1.11 0.27 0.92 1.34 SV 1.84 0.00 1.29 2.64 TB 0.91 0.55 0.68 1.23 TK 0.95 0.59 0.78 1.16 Constant 3.72 0.00 2.66 5.20 Significance: * p< 0.10, ** p<0.05, *** p<0.01 17

Full Regression Table - Language Logistic Regression Results Table - Language Number of obs = 48,862 (Result of Frequency Weights) Unique Observations = 20,714 Responses Prob > chi2 = 0.000 Pseudo R2 = 0.197 (Std. Err. adjusted for 404 clusters in userid) Variable Odds Ratio P- Value [95% Conf. Interval] Difference- In- Difference Model Treatment 1.24 0.11 0.95 1.62 Endline 1.76 <0.01*** 1.34 2.30 Treatment * Endline 1.22 0.16 0.92 1.62 Lang Learning Level 0.98 0.57 0.92 1.04 Student's Class 1.11 <0.01*** 1.05 1.17 Tuition 0.96 0.63 0.81 1.14 Minutes Usage Pre- Baseline 1.00 0.062* 1.00 1.00 Question Difficulty 0.00 <0.01*** 0.00 0.01 Gender (Ref - Male Students) Female Students 1.09 0.27 0.94 1.27 School Type (Ref - Govt School) Private School 1.75 <0.01*** 1.41 2.17 Season (Ref - May) Aug 0.70 <0.01*** 0.58 0.84 Nov 0.77 <0.01*** 0.64 0.94 Feb 0.76 0.04** 0.59 0.98 MS Center (Ref - CP) GP 0.90 0.40 0.72 1.14 SV 0.85 0.42 0.56 1.27 TB 0.90 0.38 0.71 1.14 TK 0.81 0.12 0.63 1.05 Constant 4.63 <0.01*** 3.24 6.61 Significance: * p< 0.10, ** p<0.05, *** p<0.01 18

About IDinsight IDinsight is a development consulting organization that helps policymakers and managers make socially impactful decisions using rigorous evidence. IDinsight s core service tailors experimental evaluation methodologies including, but not limited to, randomized controlled trials to the priorities of policymakers and managers. IDinsight also offers policy design consulting and scale- up support to complement evaluation activities for clients who want to maximize social impact through evidence- based decision- making. IDinsight has offices in India, Zambia, and the United States, serving government, NGO and social enterprise clients working in education, health, nutrition, agriculture, governance, sanitation, and finance. For more information, please visit www.idinsight.org. For questions on this report please contact Ben Brockman (Ben.Brockman@idinsight.org) or Ronald Abraham (Ronald.Abraham@idinsight.org). 19