At Evidence Action, we do not typically measure final impacts when we implement a program at scale. By “impacts” we mean the metric of ultimate interest – the real reason we are doing what we are doing. We don’t measure whether households with Dispensers have less diarrhea or child mortality. We don’t measure whether children that get dewormed attend school more or have better cognitive scores.
We measure whether people use chlorine and whether worm infection levels fall. These are good things, but they are steps along the way toward the stuff that we actually care about.
Measuring “means” rather than “ends” could be a controversial stance in an NGO community where M&E teams pride themselves in always measuring ‘impact.’
We think we are doing the right thing. Here’s why:
1) All of our programs have their genesis in rigorous studies that show the causal relationship between an intervention and the desired final impact (improved health or education, for example) on a population;
2) We engage in an intensive process for all programs considered for scale where we assess whether a desired impact of an intervention transfers to new contexts and a larger scale; and
3) There is no credible way to measure impact at scale since the entire beneficiary population should have access to the program, which leaves no comparison group without the intervention for accurate evaluation.
Stay with us as we explain this further.
1. The Evidence: Every program at Evidence Action begins with at least one and often more than one rigorously evaluated study – typically the gold standard of impact evaluation, a randomized controlled trial. We also conduct an in-depth literature review that identifies supporting evidence across disciplines such as medicine, behavioral economics, and anthropology.
The RCTs demonstrate in a rigorous way a causal relationship between an intervention and the impact. For instance, the evidence base in the form of a number of RCTs shows that deworming children improves their health, cognitive abilities, and time in school.
2. Beta: With sufficient evidence in experimental study and the literature, an intervention enters a stage – which we termed ‘Beta,’ where we take a close look at whether this promising intervention can be turned into a program that benefits millions. Here we grapple with the complexities of the real world environment and evaluate important requirements for scale.
We seek to answer critical questions such as: will the impact hold at scale, are there additional locations where the intervention will work, and have we designed a cost-effective service delivery model.
We examine and pressure test the external validity of the evidence. In essence, this is the extent to which the results of a rigorous study can be generalized to diverse situations and people. For example, the original studies on the impact of deworming on children were done in Kenya. Do they hold true in India? In Vietnam? In the case of deworming, this is fairly simple since the human physiology works the same way – worms rob children of nutrients, stunt growth, and reduce cognitive ability no matter where they are in the world.
However, the question of external validity may be better answered from an experimental approach when we are talking about a behavioural rather than physiological intervention – for instance, an innovative sex education curriculum focused on risk reduction that we are now rigorously testing in Botswana with the original study having been done in Kenya.
In answering the questions above, we may generate data through desk-based research, field-based research, and additional RCTs. Measuring impact at this stage is critical because we have strategic questions that are best answered with data where causal relationships must be well understood. During Beta we are forthright that we will only scale those interventions that have the desired impact, do not have harmful unintended consequences, and for which we can determine sustainable and resilient business models. If our findings in Beta suggest that walking away from the program is better than continuing to scale an intervention, we will take that step.
We are currently evaluating a number of interventions in Beta and are transparent about what we are finding.
3. Program at Scale: The very definition of scale is that the intervention – let’s say for argument’s sake, deworming children – should be available to all eligible beneficiaries. This is true in Kenya, India and Ethiopia that each have rolled out national deworming programs for all children in school. Without a comparable control group – a group that does not receive the deworming treatment – any impact measurements would not be accurate and therefore, a waste of resources.
Even if we had a meaningful control group, there are ethical implications of subjecting people to experimentation and spending limited resources on proving impact that already has a sufficient evidence base. Martin Ravallion compellingly outlines these ethical issues in this blog post. If an intervention has been rigorously proven to be beneficial and resources for that intervention are available, then there is a compelling case to allocate these resources to reaching additional beneficiaries rather than spending on more experiments with little new knowledge to be gained. This is especially salient to us when we think about all the areas where we know so little about what is effective and cost effective.
Furthermore, the ethical argument for RCTs is often that there are not enough resources to reach an entire population and thus, it is ethical to randomize who is receiving treatment. However, when there are resources to reach an entire population, withholding treatment to beneficiaries is questionable.
In our field there is much confusion about “impact measurement.” This term is often used to describe what is actually the collection of process and performance data. We do not measure impact when we scale up a program to reach millions because we have sufficient evidence to know that there is a causal relationship between the intervention and the impact.
We measure many other data points, however. What we do measure is process and performance data that allows us to make decisions about how the program is being implemented.
Think of process data as measuring the quality of how we do our job for a given program. For programs at scale like the Deworm the World Initiative, we measure how well the service was delivered to millions of schools, for example. Where there enough deworming drugs procured and did they arrive on time; were teachers trained and did they know how to administer the drugs; were parents informed and aware; and were there health professionals present?
We also measure performance data to understand whether we are reaching the right people. We count how many children were reached by the program to ensure that we have adequate coverage rates and if, as a result of having reached sufficient coverage, worm prevalence and infection rates decrease over time. These numbers do not measure the impact on the intended outcome, but they act as a proxy of whether the impact was achieved based on the original evidence and tested delivery models.
Another good example of what we measure and why for a program at scale, is highlighted in this recent blog post on Dispensers for Safe Water.
The development community spends millions on programs with little rigorous evidence that the work actually achieves its stated goals. We believe an intervention must be rooted in solid evidence and then further tested to ensure that it works as intended at scale. A program that we decide to implement for millions has met sufficient requirements that leave us confident of the impact instead of relying on weak forms of measurement.
When a program has reached that stage, we measure to know that we actually reach the people we intend to reach, and do so with a high-quality, sustainable, and resilient program.