Sampling and Sample Size. Shawn Cole Harvard Business School

Sampling and Sample Size Shawn Cole Harvard Business School

Calculating Sample Size Effect Size Power Significance Level Variance ICC EffectSize 2 ( ) 1 σ = t( 1 κ ) + tα * * 1+ ρ( m 1) P N ( 1 P) Proportion in Treatment Sample Size Average Cluster Size

Let s Take a Step Back What are we trying to do when we evaluate a program? Trying to measure and demonstrate the existence of impact (positive or negative) One helpful way of thinking about how we approach program evaluation is to think about an analogy from criminal law. J-PAL SAMPLING AND SAMPLE SIZE 3

Burden of Proof: Proving Guilt In criminal law, most institutions follow the rule: innocent until proven guilty The presumption is that the accused is innocent and the burden is on the prosecutor to show guilt The jury or judge starts with the prior that the accused person is innocent The burden of proof is on the prosecutor to show that the accused person is guilty J-PAL SAMPLING AND SAMPLE SIZE 4

Burden of Proof: Demonstrating Impact In program evaluation, instead of presumption of innocence, the rule is: presumption of insignificance We begin with the assumption that there is no (zero) impact of the program The burden of proof is on the evaluator to show a significant effect of the program J-PAL SAMPLING AND SAMPLE SIZE 5

Burden of Proof: Conclusions If it is very unlikely (e.g., less than a 5% probability) that the difference is solely due to chance: We reject the hypothesis that there is no impact of the program We may now say: our program has a statistically significant impact J-PAL SAMPLING AND SAMPLE SIZE 6

Demonstrating Impact Choose a confidence level at which we feel confident saying that an impact we hope to observe probably (say, a specific increase in test scores) did not arise purely due to chance Typically 95% (or significance level: α=5%). This means that if the program did not have any impact and we were to run our experiment 100 times, we would only observe impact purely by chance 5 times. Commonly accepted threshold for significance (don t ask us why though!) J-PAL SAMPLING AND SAMPLE SIZE 7

What is the Significance Level? Type I Error: Finding evidence of impact even if there actually is no impact Significance level: The probability that we will find evidence of impact when there is none. J-PAL SAMPLING AND SAMPLE SIZE 8

Demonstrating Impact: 95% confidence You Conclude Effective No Effect Effective Type II Error (low power) The Truth No Effect Type I Error (5% of the time) J-PAL SAMPLING AND SAMPLE SIZE 9

What is Power? Type II Error: Finding no evidence of impact even though there actually is impact. Power: If there is a measureable effect of our intervention (i.e. the program has an impact), the probability that we will detect an effect (i.e. that we will be able to pick up on this impact) J-PAL SAMPLING AND SAMPLE SIZE 10

Type I versus Type II errors You Conclude Effective No Effect Effective Type II Error (low power) The Truth No Effect Type I Error (5% of the time) J-PAL SAMPLING AND SAMPLE SIZE 11

Source: https://effectsizefaq.com/category/type-ii-error/

Demonstrating Impact Choose a confidence level (95%) at which we feel confident saying that any impact observed probably did not arise purely due to chance. Choose a confidence level at which we feel confident saying that if an impact was not observed, we probably did not mistakenly miss impact purely due to chance. Typically 80%. This means that if the program did have an impact and we were to run our experiment 100 times, we would observe no impact purely by chance 20 times. Commonly accepted threshold for power (again, don t ask us why though!) J-PAL SAMPLING AND SAMPLE SIZE 13

What influences power? In an ideal world, we could have a very high significance threshold (99%) and very high power (99%) But this would typically require an extremely large sample What are the various parameters affecting the power of a study? Which parameters are fixed and which ones can be changed? What are your constraints? Budget? Sample Size? Intervention Costs? Survey Costs? J-PAL SAMPLING AND SAMPLE SIZE 14

Power: main considerations 1. Sample Size 2. Effect Size 3. Take-up 4. Variance 5. Proportion of sample in T vs. C 6. Clustering J-PAL SAMPLING AND SAMPLE SIZE 15

Power: main considerations 1. Sample Size 2. Effect Size 3. Take-up 4. Variance 5. Proportion of sample in T vs. C 6. Clustering J-PAL SAMPLING AND SAMPLE SIZE 16

By increasing sample size you increase A. Accuracy B. Precision C. Both D. Neither E. Don t know J-PAL SAMPLING AND SAMPLE SIZE 17

Intuition about sample size We run an experiment on a sample that is a randomly selected subset of the population. Random sampling Population Sample J-PAL SAMPLING AND SAMPLE SIZE 18

Intuition about sample size The larger the sample, the more representative of the population that it is likely to be. Random sampling Population Sample Sample Sample J-PAL SAMPLING AND SAMPLE SIZE 19

Intuition about sample size Larger sample More representative of the population Larger sample More likely that experiment is capturing any impact that would occur in the population. Larger sample Minimize Type II errors. Larger sample Maximizes power and precision J-PAL SAMPLING AND SAMPLE SIZE 20

RULE OF THUMB NUMBER 1 A larger sample gives you more power Less power More power J-PAL SAMPLING AND SAMPLE SIZE 21

By increasing sample size you increase A. Accuracy B. Precision C. Both D. Neither E. Don t know J-PAL SAMPLING AND SAMPLE SIZE 22

Power: main considerations 1. Sample Size 2. Effect Size 3. Take-up 4. Variance 5. Proportion of sample in T vs. C 6. Clustering J-PAL SAMPLING AND SAMPLE SIZE 23

Does a larger effect require a larger or smaller sample? A. Larger sample B. Smaller sample C. Don t Know J-PAL SAMPLING AND SAMPLE SIZE 24

Some intuition Remember, we want to identify the effect both accurately and precisely. Randomization gives us accuracy by eliminating bias. Power is about precision. Think about how precisely you could identify an object based on its size J-PAL SAMPLING AND SAMPLE SIZE 25

Which of these two images can you identify more precisely? IMAGE 1 IMAGE 2 J-PAL SAMPLING AND SAMPLE SIZE 26

Which of the two images can you identify more precisely? A. Image 1 B. Image 2 C. I can identify both equally precisely/imprecisely J-PAL SAMPLING AND SAMPLE SIZE 27

Accuracy vs. Precision Here s the image again J-PAL SAMPLING AND SAMPLE SIZE 28

Accuracy vs. Precision: Which of these is the image from the previous slide? IMAGE 1 IMAGE 2 J-PAL SAMPLING AND SAMPLE SIZE 29

Which of these is the image from the previous slide? A. Left B. Right C. Both J-PAL SAMPLING AND SAMPLE SIZE 30

Let s take another look... IMAGE 1 IMAGE 2 J-PAL SAMPLING AND SAMPLE SIZE 31

Intuition about effect size A larger effect is like the larger image: It allows you to reliably identify precisely what the image says. Think of sample size as allowing you to zoom in on an image. A larger image requires less zoom i.e. a smaller sample A smaller image requires more zoom i.e. a larger sample J-PAL SAMPLING AND SAMPLE SIZE 32

Alternative intuition You have to decide whether a coin is fair or unfair. You win a dollar if you are right You can toss the coin ten times. Nickels: The unfair nickel lands heads 9 out of 10 times Dimes: The unfair dime lands heads 6 out of 10 times Which level of unfairness is easier to detect? J-PAL SAMPLING AND SAMPLE SIZE 33

RULE OF THUMB NUMBER 2 Effect size and sample size are inversely related for a given level of power. smaller sample required larger sample required J-PAL SAMPLING AND SAMPLE SIZE 34

Does a larger effect require a larger or smaller sample? A. Larger sample B. Smaller sample C. Don t Know J-PAL SAMPLING AND SAMPLE SIZE 35

Program Effect vs. Detectible Effect The intervention has an effect (or not) that is independent of a study or evaluation (e.g., Head Start increases tests scores by.3 standard deviations) When designing a study, the sample and design you choose determine the minimum detectible effect size An experiment with enough power to detect an effect of.3 standard deviations will detect (with even more power) a larger effect. It will have less power if the true effect is less than.3 So, what effect size should you use when designing experiment? J-PAL SAMPLING AND SAMPLE SIZE 36

Consider the following options Low minimum detectible effect size More power to detect any given effect (good) Larger sample required (bad) The smallest effect size at which the program is costeffective The largest effect size you expect the program to have Requires smaller sample J-PAL SAMPLING AND SAMPLE SIZE 37

What effect size should you use when designing your experiment? A. Smallest effect size that is still cost effective B. Largest effect size you expect your program to produce C. Both D. Neither J-PAL SAMPLING AND SAMPLE SIZE 38

Power: main considerations 1. Sample Size 2. Effect Size 3. Take-up 4. Variance 5. Proportion of sample in T vs. C 6. Clustering J-PAL SAMPLING AND SAMPLE SIZE 39

If you anticipate imperfect take-up, should you increase or decrease your sample size? A. Increase B. Decrease C. Don t Know J-PAL SAMPLING AND SAMPLE SIZE 40

Does imperfect take-up increase or decrease your effect size? A. Increase B. Decrease C. Don t Know J-PAL SAMPLING AND SAMPLE SIZE 41

Take-up and effect size Let s walk through a numerical example Say you have a program that increases savings by $1 for every person who takes up the program. You offer 4 people in the treatment group this program and have 4 people in the control group who do not receive this program. J-PAL SAMPLING AND SAMPLE SIZE 42

Effect size with 100% take-up Treatment Effect size i.e. avg. effect = Control 1+1+1+1 4 ( 0 4 ) = $1 J-PAL SAMPLING AND SAMPLE SIZE 43

Effect size with 50% take-up Treatment Effect size i.e. avg. effect = Control 1+0+1+0 4 ( 0 4 ) = $0.5 J-PAL SAMPLING AND SAMPLE SIZE 44

Remember RULE OF THUMB NUMBER 2? Effect size and sample size are inversely related for a given level of power. smaller sample required larger sample required J-PAL SAMPLING AND SAMPLE SIZE 45

RULE OF THUMB NUMBER 3 Imperfect take-up will necessitate a larger sample for a given level of power. If you anticipate imperfect take-up, plan for it with a larger sample; and/or work to increase take-up smaller sample required larger sample required J-PAL SAMPLING AND SAMPLE SIZE 46

If you anticipate imperfect take-up, should you increase or decrease your sample size? A. Smaller B. Larger C. Don t Know J-PAL SAMPLING AND SAMPLE SIZE 47

Power: main considerations 1. Sample Size 2. Effect Size 3. Take-up 4. Variance 5. Proportion of sample in T vs. C 6. Clustering J-PAL SAMPLING AND SAMPLE SIZE 48

What will more variation in the underlying population do to our estimates? A. Increase risk of bias? B. Reduce risk of bias C. Increase precision of estimate D. Reduce precision of estimate E. Will not change estimates J-PAL SAMPLING AND SAMPLE SIZE 49

Intuition about variance Say our program seeks to increase child height. Lot of variation in height in the population. We might end up with a sample that has mostly tall individuals. or one with mostly short. Population Sample Random sampling J-PAL SAMPLING AND SAMPLE SIZE 50

Intuition about variance In a population with higher variance, more chance we get an unrepresentative sample. Contrast with a population with lower variation in height. More chance we end up with a representative sample. Population Sample Random sampling J-PAL SAMPLING AND SAMPLE SIZE 51

Implications of higher variance Remember, our program seeks to increase child height. Lot of variation in height in the population. At endline children in treatment are taller than in control. Is this because we happened to start with taller children? Or because the program worked? Program implemented Population Sample Treatment Random sampling Control J-PAL SAMPLING AND SAMPLE SIZE 52

Implications of higher variance If everyone in the underlying population was of similar height at the start, it would be easy to sort this out. More likely to get a representative sample. Variation we see at end between treatment and control must be due to the program. Program implemented Population Sample Treatment Random sampling Control J-PAL SAMPLING AND SAMPLE SIZE 53

RULE OF THUMB NUMBER 4 For a given level of power, higher variance larger sample needed. larger sample required smaller sample required J-PAL SAMPLING AND SAMPLE SIZE 54

When there is high variance in the underlying population, you should A. Reduce the population variance B. Decrease your sample size C. Increase your sample size J-PAL SAMPLING AND SAMPLE SIZE 55

Power: main considerations 1. Sample Size 2. Effect Size 3. Take-up 4. Variance 5. Proportion of sample in T vs. C 6. Clustering J-PAL SAMPLING AND SAMPLE SIZE 56

For a given sample size, the sample should always be equally split between treatment and control A. True B. False C. It depends J-PAL SAMPLING AND SAMPLE SIZE 57

Let s go back to that equation Effect Size Power Significance Level Variance ICC EffectSize 2 ( ) 1 σ = t( 1 κ ) + tα * * 1+ ρ( m 1) P N ( 1 P) Proportion in Treatment Sample Size Average Cluster Size

Focus on the proportion in treatment Effect Size Power Significance Level Variance ICC EffectSize 2 ( ) 1 σ = t( 1 κ ) + tα * * 1+ ρ( m 1) P N ( 1 P) Proportion in Treatment Sample Size Average Cluster Size

Maximizing power for a given sample P refers to the proportion of the sample in the treatment group. P is always some number between 0 and 1. To get maximum power for a given sample size, we need to minimize This term is minimized when P is 0.5 i.e. when half the sample is in the treatment group. J-PAL SAMPLING AND SAMPLE SIZE 60

RULE OF THUMB NUMBER 5 For a given sample size, power is maximized when the sample is equally split between T and C. Sample (n=8) Treatment (n=4) Control (n=4) J-PAL SAMPLING AND SAMPLE SIZE 61

Other considerations Adding additional subjects unequally still increases your power Could be low cost to have a much bigger control group (if administrative data available). However, diminishing returns. Power may not be only consideration in how you split sample. If implementation is costly, may want bigger proportion in control. Resources might be better spent on tracking some participants better than on having more participants. We will talk about attrition later) J-PAL SAMPLING AND SAMPLE SIZE 62

For a given sample size, the sample should always be equally split between treatment and control A. True B. False C. It depends J-PAL SAMPLING AND SAMPLE SIZE 63

Power: main considerations 1. Sample Size 2. Effect Size 3. Take-up 4. Variance 5. Proportion of sample in T vs. C 6. Clustering J-PAL SAMPLING AND SAMPLE SIZE 64

Review: How do we randomize in our sample? J-PAL SAMPLING AND SAMPLE SIZE 65

Can randomize individuals to T or C J-PAL SAMPLING AND SAMPLE SIZE 66

Can randomize clusters: e.g. neighborhoods or schools J-PAL SAMPLING AND SAMPLE SIZE 67

Can randomize clusters: e.g. neighborhoods or schools J-PAL SAMPLING AND SAMPLE SIZE 68

To achieve the same power as an individual level randomization, a clustered design is likely to require A. A smaller sample size B. A bigger sample size C. The same sample size D. Don t know J-PAL SAMPLING AND SAMPLE SIZE 69

Intuition behind clustering You want to know how close the upcoming national elections will be Method 1: Randomly select 50 people from entire US population Method 2: Randomly select 10 families, and ask five members of each family their opinion J-PAL SAMPLING AND SAMPLE SIZE 70

Is Method 1 or Method 2 going to give you a better idea of how close the election will be? A. Method 1 (50 random people) B. Method 2 (5 members each of 10 random families) J-PAL SAMPLING AND SAMPLE SIZE 71

People within a cluster may behave in similar ways Population Control Treatment J-PAL SAMPLING AND SAMPLE SIZE 72

People within a cluster may not behave in similar ways Population Control Treatment J-PAL SAMPLING AND SAMPLE SIZE 73

Intra-cluster correlation (ICC) The intra-cluster correlation (ICC) is simply a measure of how similar individuals within a cluster are along an outcome of interest. The ICC is also known as the rho (Greek symbol ρ). ICC can be high or low If rho=1, everyone in the cluster behaves exactly the same If rho=0, cluster identity does not correlate with behavior J-PAL SAMPLING AND SAMPLE SIZE 74

Implications of ICC for power For a given sample size, less power when randomizing by cluster (unless ICC is zero). However, may still need to randomize by cluster for other reasons (spillovers, logistics, etc.) Diminishing returns to surveying more people per cluster. Usually, number of clusters is key determinant of power, not number of people per cluster. J-PAL SAMPLING AND SAMPLE SIZE 75

RULE OF THUMB NUMBER 6 For a given sample size, less power when randomizing by cluster (unless ICC is zero). smaller sample required larger sample required J-PAL SAMPLING AND SAMPLE SIZE 76

RULE OF THUMB NUMBER 7 For a given level of power, higher ICC larger sample needed. The higher the ICC, the better off you are by increasing sample by adding clusters rather than individuals to clusters. larger sample required smaller sample required J-PAL SAMPLING AND SAMPLE SIZE 77

All dropouts live in one area. People in school live in another. College grads live in a third, etc. ICC (ρ) on education will be.. A. High B. Low C. No effect on rho D. Don t know J-PAL SAMPLING AND SAMPLE SIZE 78

If ICC (ρ) is high, what is a more efficient way of increasing power? A. Include more clusters in the sample B. Interview more people in each cluster C. Both D. Don t know J-PAL SAMPLING AND SAMPLE SIZE 79

So let s recap

Summarizing the rules of thumb 1. Larger sample More power 2. Smaller effect size Larger sample size required 3. Lower take-up Larger sample size required 4. High variance in population Larger sample required 5. For given sample, equal T and C split maximizes power 6. For given sample, clustering Lower power 7. Higher ICC Larger sample required J-PAL SAMPLING AND SAMPLE SIZE 81

Calculating power in practice Finding the ingredients for the power equation

Calculating power: A step-by-step guide 1. Set desired power (80%) and significance (95%). 2. Calculate residual variance (& ICC if clustering) using pilot data, national data sources, or data from other studies. 3. Decide number of treatments. 4. Set minimum detectable effect size for T vs. C and between treatments. 5. Decide allocation ratio. 6. Calculate sample size. 7. Estimate resulting budget. 8. Adjust parameters above (e.g. cut number of arms). 9. Repeat! J-PAL SAMPLING AND SAMPLE SIZE 83

Residual Variance Estimate variance from data from similar populations Population variance is what it is, not much we can do. Can help distinguish if endline height difference due to: Program increasing height. Chance allocation of tall children to treatment at baseline. Treatment Population Control J-PAL SAMPLING AND SAMPLE SIZE 84

Residual variance: intuition Some part of endline height can be explained by baseline height. Accounting for this allows a more precise estimate of intervention effect. Control Treatment Baseline Endline J-PAL SAMPLING AND SAMPLE SIZE 85

Residual variance Variance reduces power because of the risk we might pick very successful people in one group (e.g., treatment) Some variation can be explained by observables Older kids are taller Using controls in analysis soaks up variance impact more precisely estimated more power. Calculate residual variance by regressing outcome on controls in existing data Baseline value of outcome good control In Stata power, can adjust for multiple rounds of data Need estimate of correlation in outcome between rounds J-PAL SAMPLING AND SAMPLE SIZE 86

Estimating ICC (ρ) ICC must be between 0 and 1. Depends on context and outcome variable. To estimate Rho need big samples. Check sensitivity of your power and sample size calculations to different possible values of the ICC. J-PAL SAMPLING AND SAMPLE SIZE 87

Some values of ICC observed in earlier studies Malawi: Households produces maize 0.003 Sierra Leone: Households produce cocoa 0.57 Sierra Leone: Average rice yields 0.04 Busia, Kenya: Math and language test scores 0.22 Busia, Kenya: Math test scores 0.62 Mumbai, India: Math and language test scores 0.28 Texas: Precinct voting preferences 0.20 Italy: Hospital admissions 0.06 United States: Weight and cholesterol levels 0.02 United States: Reading achievement scores 0.22 J-PAL SAMPLING AND SAMPLE SIZE 88

Number of treatment arms Different treatment arms help disentangle different mechanisms behind an effect. More arms larger sample required. Say we have two treatment arms. Need to think about two values of the minimum detectable effect size. For comparing T vs C For comparing T1 vs T2 J-PAL SAMPLING AND SAMPLE SIZE 89

Multiple treatment arm tips Good to have at least one intensive arm, where a zero would be interesting. In the analysis, may be useful to pool all treatment arms, creating an any treatment arm. Have to scale up sample size given from statistical packages. Stata gives sample per cell 2 treatments + C = 3 cells Optimal Design assumes one treatment, one control Divide result by 2 to get sample size per cell J-PAL SAMPLING AND SAMPLE SIZE 90

Unequal allocation ratio If budget covers treatment and evaluation, more expensive to add one person to T than C. Unequal allocation ratio gives a bigger total N, which may be worth it. Try different allocation ratios within a given budget and explore tradeoffs. With multiple treatments put more sample behind most important question. If going to pool treatments, have bigger control. If really care about between treatment differences need bigger sample in treatment groups. J-PAL SAMPLING AND SAMPLE SIZE 91

Unequal allocation: example (1) Basic Gender attitudes, literacy 76 education, health information (2) Livelihoods Basic plus financial literacy 77 (half, 39, receive savings club) (3) Full Livelihoods plus oil 77 (half, 38, receive savings club) (4) Oil Oil incentive for unmarried girls 77 (5) Control 153 Total 460 (6) Savings Savings club cross cut with 77 cross cut Full and Livelihoods J-PAL SAMPLING AND SAMPLE SIZE 92

Minimum detectable effect size (MDE) The most important ingredient for calculating power. MDE is not the effect size we expect or want. MDE is the effect size below which we may not be able to distinguish the effect from zero, even if it exists. i.e. below which effect might as well be zero. Remember, you will be powered for effects larger than the MDE! J-PAL SAMPLING AND SAMPLE SIZE 93

Questions to ask when determining MDE Below what effect size would the program not be cost effective? How big an effect would we need to make the result interesting? Do we want smaller MDE between arms than between T and C? Common mistake is powering on T vs. C so don t have power to distinguish between arms. J-PAL SAMPLING AND SAMPLE SIZE 95

Calculating power in Stata Stata has a new command power. But does not allow for clustering (yet). Most still use sampsi and sampclus, or clustersampsi (add ons). Defaults: power 90%, significance 5%, equal T & C split/ Example (no clustering): To detect an increase in average test scores from 43% to 45%, with power of 80%: sampsi 0.43 0.45, power(0.8) sd(0.05) Stata gives N per cell e.g. N1=99 With multiple arms, need to multiply by number of cells (i.e. number of treatments plus control) For binary outcomes, SD determined by mean. J-PAL SAMPLING AND SAMPLE SIZE 96

Power in Stata with clustering First calculate sample size without clustering and then add information on clusters. Example: To detect an increase in average test scores from 43% to 45%, with power of 80%, randomized at class level with 60 per class and ICC of 0.2: sampsi 0.43 0.45, power(0.8) sd(0.05) sampclus, obsclus(60) rho(0.2) J-PAL SAMPLING AND SAMPLE SIZE 97

Power with Optimal Design (OD) OD: Free software specifically designed for power calculations. MDE must be entered in standardized effect size (i.e. effect size divided by standard deviation.) OD allows multiple levels of clustering and works with dropdown menus see J-PAL exercise for details J-PAL SAMPLING AND SAMPLE SIZE 98

Thank you!