Trials and Tribulations: Relevance Beyond the Poverty Lab

BY COURTNEY HAN

The early 2000s was a heady time to be a researcher in Busia, Kenya. The town along Kenya’s western border was packed with young aspiring economists sharing group houses, waiting for roast meat at Chauma, the local eatery, and practicing Kiswahili with their Kenyan host families. They worked on projects ranging from subsidized school uniforms to public finance decentralization, health, agriculture and banking.[1] There was even a study on the role of rubber boots in reducing worm infestation.[2] For the next decade, Busia would serve as the training ground for a new generation of development economists, some of whom have gone on to become highly respected economists at top universities and research institutions.

These researchers, sometimes called the “randomistas,”[3] were practicing a new method within development economics called the randomized control trial (RCT). Also referred to as field trials, field experiments, or program and impact evaluations, the RCT method is akin to pharmaceutical drug trials. But instead of taking place in a controlled laboratory setting, RCTs take place in real communities. The method randomly assigns an intervention—free textbooks, for example—to some study participants, and not to others. Since all participants have an equal chance of receiving the intervention, they are, on average, the same across all observed and unobserved characteristics. At the end of the trial, differences in outcomes between the two groups can be causally attributed to the intervention itself and not just innate differences between the groups, thereby resolving the identification problem endemic to social science research.

Causal attribution can only be made if an RCT study is well designed and implemented. Accounting for this kind of methodological rigor is expensive, with studies averaging between a quarter[4] to half a million dollars, if not more.[5] They are also time-consuming. Each study requires approval from an ethical review board, well-trained surveyors, a strong understanding of the local context, and an implementing partner’s agreement to stay faithful to the initial design of the intervention for the duration of the study. These conditions result in a high fixed cost to run a successful RCT study, and help to explain why, in the early 2000s, many researchers chose to conduct their work in western Kenya, which already had the initial infrastructure in place, thanks to Harvard economist Michael Kremer.

Kremer, who is credited with pioneering the method in the international development field, first went to western Kenya in 1994 to visit a friend who was working for Dutch nonprofit International Child Support (ICS).[6] He was curious about how ICS’s free textbooks project affected test scores, so with the approval of ICS, Kremer set up a small experiment in which he randomly selected seven schools to receive free textbooks and seven that would not.[7] Kremer went on to study many other ICS projects in subsequent years, including one on distributing deworming pills for children in primary school.[8]

As the first Kenyan students were receiving their pills in January 1998, 26- year-old Esther Duflo was putting the finishing touches on her Economics dissertation at the Massachusetts Institute of Technology (MIT). Like Kremer, Duflo was interested in using real-world data to answer causal questions in development and she soon teamed up with him in western Kenya to experiment with fertilizer sales for farmers.[9]

By 2004, the duo had made significant progress. Kremer’s deworming paper was published in the journal Econometrica to widespread acclaim.[10] Duflo had already turned down tenure offers at Princeton and Yale, and instead created a home for RCT studies at MIT in what is known today as the Abdul Latif Jameel Poverty Action Lab (J-PAL).[11] Duflo envisioned J-PAL as a network singularly devoted to field experiments, where practitioners could collaborate and share resources, knowledge, and costs.[12]

RCTs marked a significant departure from the prevailing development trends at the time. There was broad disillusionment with the indiscriminant application of the Washington Consensus, a cocktail of deregulation, privatization and trade liberalization reforms that the International Monetary Fund (IMF) and World Bank served to crisis-ridden developing countries.[13] The 300 billion dollars of aid money distributed across the African continent since 1970 had not increased economic growth.[14] Rather, average five-year GDP growth rates had fallen 4 percent between 1970 and 1995, and the percentage of people who lived on less than $1 per day climbed from 48 percent to 60 percent over the same time period.[15] The prevailing development prescription seemed in need of an upgrade.

By institutionalizing RCTs through J-PAL, Duflo married science with practicality and brought the fresh rigor of econometrics to pressing world problems. In doing so, she transformed what Victorian philosopher Thomas Carlyle called the “dismal science” of economics into the role of arbitrator of what does and doesn’t work in development.[16] The media bite-sized, scientifically sound findings were immensely palatable to an aid industry starved for accountability. Perhaps not surprisingly, RCTs caught on like wildfire.

From Laboratory to a Revolution

By its 10th anniversary in 2013, randomized trials had become “the most popular technique in development economics,” according to Shanta Devarajan, World Bank Chief Economist for North Africa and the Middle East.[17] Duflo herself appeared at the World Economic Forum, the UN General Assembly and the prestigious Collège de France in her native France, after which the British newspaper, The Independent, elevated her to the same stratosphere as Voltaire, Rousseau, and Sartre by crowning her the “new face of Left Bank intellectualism.”[18] She received a MacArthur genius grant in 2009, followed by the John Bates Clark Medal, a “baby Nobel,” for the best economist in America under 40.[19]

As of January 2016, Duflo’s J-PAL network has grown to over 100 researchers and 698 projects, and has established a reputation as the “gold standard” of research.[20] RCT centers now exist at U.C. Berkeley, the University of Chicago, and Harvard University. The study of RCT has also spawned a nonprofit, Innovations for Poverty Action (IPA); a major donor called the International Initiative for Impact Evaluation (3ie); and a new division at the World Bank called the Development Impact Evaluation Initiative (DIME).[21]

In its short life, RCTs have changed how donors think, and consequently, what they fund. The UK Department for International Development (DFID), the US Agency for International Development (USAID), the World Bank, and the Global Innovation Fund, to name only a few, have integrated evidence-based learning and evaluations into their programs.[22] The movement towards evidence has resulted in what Levine and Savedoff call a paradigm shift away from “input/output . . . that tracks resources used and deliverables produced to judge success towards one centered around outcomes and causal attribution.”[23]

The revolution has cost a lot of money—and ink. Evaluations funder 3ie estimates over 2,400 impact evaluations have been published since 2000.[24] If the average RCT costs half a million dollars, that’s $1.2 billion worth of development money spent on evaluations in fifteen years. Major newspapers and journals continue to praise the RCT revolution, with articles lauding the method continuing to appear regularly in publications like The New York Times[25] and The Economist.[26]

Dissidents of the Revolution

The RCT revolution is not without critics. One persistent concern is external validity, or how relevant one study’s results are to other contexts. Randomistas claim that some degree of external validity exists, though to what extent is unknown, and this unknown has often become a justification for more RCTs. RCTs also can’t address complex macroeconomic policy (one can’t randomly assign fiscal policy to different countries); cannot test questions that may cause respondents harm (randomizing pregnancy, for example, or smoking); nor can they be used in crisis conditions, when the moral consequences of withholding supply outweigh the gains of learning.

Several particularly vocal skeptics include New York University economist William Easterly,[27] and Harvard economists Ricardo Hausmann and Lant Pritchett. Hausmann argues that evaluations are unimaginative, and “minimize the design possibilities [of a program].”[28] He likens randomistas to “auditors in charge of the design department . . . who have sacrificed learning on the altar of identification.”[29] Pritchett, a self-proclaimed “early non-adopter” of the method, has a similar view that RCTs hold back progress by imposing rigid protocols that stifle the tinkering that effective programs should undertake to improve.[30] Others, like Harvard economist Dani Rodrik, find the randomistas’ preoccupation with micro-scale projects to be limited in impact, when other issues, such as the right industrial policy, have far greater power to reduce poverty.[31]

Perhaps the most famous dissident is Angus Deaton, the Princeton economist and 2015 recipient of the Nobel Prize in Economics. Deaton lampooned RCTs in a 2008 presentation with one PowerPoint slide. On the left was a picture of a person in the sky with a parachute, and on the right was a person jumping out of a plane without a parachute. The pictures were labeled, “Esther Duflo” and “Abhijit Banerjee,” Duflo’s former advisor, frequent collaborator, and now husband.[32] The point was harsh: We don’t need to test that parachutes are useful for people who jump from planes. In other words, why use an RCT to test a question that can be answered by theory? Learning that an English textbook won’t help a non-English-speaking second-grader improve his reading score doesn’t require two years of rigorous evaluation. Some common sense would do, and it would be a lot cheaper.

Problems of Replicability and Adoption

For the most part, critics have not managed to quell the media frenzy over RCTs, but fifteen years of studies help to bring some perspective. At J-PAL’s 10th anniversary, presenters celebrated several examples of programs that have achieved large-scale success as a result of evaluation. Kremer’s deworming study, for example, has led to over 90 million treated children across Kenya and India.[33] Safe Water Dispensers, initially a chlorine water dispenser study, now serve clean water to 4.1 million people in Kenya, Malawi, and Uganda.[34]

While RCTs can unearth some consistent evidence of what works in development, they may be an exception rather than the rule. Replication studies and meta-analyses have not found a compelling case for the generalizability of results.[35] One recent meta-study by economist Eva Vivalt finds that outcomes across similar studies have differed dramatically; after grouping results by intervention category, the average variation[36] on outcomes was four times the size of the same measure of variation in medical trials.[37] Vivalt also finds that one study’s outcome was a very poor predictor[38] of outcomes for the same type of program in a different context, and wrongly predicts replication results if redone in the same setting by 45 percent.[39] Perhaps it is not a big surprise that out of 698 evaluations, the J-PAL website lists only 15 cases of scale-up after an intervention.[40] That’s a 2 percent success rate—paltry, even for international development’s standards.

Perhaps more troubling is the fact that RCTs, by virtue of their design, conflate interventions with their implementation when measuring impact. This can be problematic if RCT studies struggle to distinguish between a good intervention idea that is poorly implemented and a well-implemented but bad idea, as both can yield the same outcome. For instance, in 2012, Duflo and others found that short-term contract teachers in western Kenya were able to raise test scores better than civil service teachers.[41] But when the Kenyan Ministry of Education scaled up the contract teacher program, there was no improvement. Another group of researchers replicated the Duflo study, using a nonprofit to run a contract teachers program alongside the government’s national program. Contract teachers became effective again—but only under the study’s nonprofit implementing partner. In the intervention, the context and the timing were identical, but varying just one dimension—the implementer—changed the final outcome.[42] Good intervention ideas are necessary, but not sufficient, for impact. Implementation is critical too.

If external validity is a tenuous supposition, RCTs may have at least helped organizations revise their operations in cases with strong internal validity. But this appears unsupported by the evidence. Over thirty-five projects have taken place in western Kenya, and nearly every household in the region has been part of a study at some point in time.[43] Since internal validity is the first-order priority of an RCT, western Kenya should be the ideal context for any one of the thirty-five studies to be scaled-up or applied locally. For the most part however, interventions found to have significant positive effects in Busia have not been scaled. Carolyn Nekesa, a Busia resident and longtime collaborator on Kremer and Duflo’s studies, noted that while RCTs have contributed to the knowledge base of the region, there is little evidence of lasting intervention presence.[44]

One explanation for this absence of lasting impact is that researcher agendas do not align well with those of implementers. In Busia in particular, many studies were designed and run entirely by academics, leaving little room for scale-up because there was no partner to carry the implementation forward after the study ended. The protracted lag time between research and publication—the average gap taking 4.17 years—also reduces the relevance of results.[45]

Another major bottleneck to implementation is political. Kenya’s 2013 devolution led to the establishment of county government posts that have not received clear directives on the extent of their power.[46] This has affected which level of government a persistent researcher must reach to disseminate findings. For example, staff from a 2013 Busia study on rural micro-grid financing schemes secured a meeting with the National Electrification Strategy Committee, the Kenyan agency in charge of rural electrification. But due to the piecemeal process of devolution, substantive county-level planning is on hold until electricity provision is successfully devolved.[47]

Political constraints are one of many administrative, practical, and personal constraints that implementers optimize across on a daily basis. Is this policy relevant to my constituents’ needs? Is this policy politically feasible? Will I lose my job if I propose this policy? Will I have to do much more work if this policy succeeds? The optimal answers to such questions do not naturally align with the economic equilibrium that is the randomistas’ mono-focus. Thus in the context of the practical complexities of policymaking, it is not so unreasonable that supremely well-identified and quantifiable solutions are rarely applied, even in places where interventions have high internal validity.

Such practical realities help to explain why the implementation of RCT findings have underperformed compared to expectations. More importantly, practical realities are weakening a critical assumption that underpins the RCT enterprise as a whole, which is that the problems of poverty stem from a lack of precise answers. The RCT enterprise assumes that once concrete evidence is found, implementers will use such evidence effectively to reduce poverty. However, the evidence is showing, both through a lack of scale-up and adoption, that this assumption is not necessarily true.

The Next Generation

In fifteen years, RCTs have generated tremendous influence by moving the development paradigm from the rhetoric of hope to one of evidence. Organizations like J-PAL have added thousands of rigorous evaluations to the development knowledge base. But the resulting body of research shows mixed replicability of results and low adoption by the policy community. If the economist has not replaced the work of the development practitioner, and data is not solving the problem of poverty, where does this leave J-PAL and its affiliates?

The development consultancy IDInsight recently introduced a useful framework that separates RCT studies into two categories: Knowledge-Focused Evaluations and Decision-Focused Evaluations.[48] IDInsight considers the former to be the traditional focus of J-PAL and its affiliates, and categorizes itself into the latter, which deploys methods of rigorous evaluation in bespoke ways, as requested by an organization seeking solutions to particular programmatic problems within particular constraints.[49] In the process, they are willing to sacrifice some internal validity and scientific rigor for gains in time and learning. Other organizations like Dalberg and Yale economist Dean Karlan’s “Goldilocks” project, which helps nonprofits build best-fit monitoring and evaluation (M&E) systems, also fall into this second category, and might be referred to as “RCT practitioners.” [50] J-PAL and IPA, the traditional “RCT purists,” are best positioned to continue expanding and refining the knowledge base, rather than diluting their resources to reach for policy relevance.

J-PAL, IPA, 3ie, and the World Bank are taking steps to generate policy briefs and place them into the right hands.[51] They are partaking in “matchmaking” between researchers and policymakers, and adding to the replications literature to consolidate learning.[52] These are all steps in the right direction, but well-written policy briefs alone are not sufficient. RCT purists need to improve on better synthesizing and disseminating the causal chain of events within studies, not only trawl for patterns across outcomes. They should deploy effective communicators—program managers, media contacts, journalists—to move the dialogue from “more data” and “RCTs are good” to promoting a better understanding of the data that the movement has amassed.

RCT purists also need to strengthen ties with thinkers and planners within governments and implementing agencies, and not only when it is helpful for a new study. The Kenya IPA office is a good example of an organization making this transition. Though slow and sometimes difficult to coordinate, strengthening ties to government is a step in the right direction. In 2014, IPA attended a slew of education and health conferences to share RCT findings.[53] The organization now has a seat at breakfast meetings with the Ministries of Education and Health, and an agreement with the Ministry of Education to help update Kenya’s school curriculum.[54] The results are promising, if not always recognized. “There’s an attribution problem,” says Francis Meyo, the IPA Kenya Policy Coordinator.[55] Government offices take the lessons they circulate and sometimes use them, but don’t always connect an idea to an RCT. This is precisely the role that purists should be striving for: presence, educated influence, and agnosticism about cookie cutter results from past RCTs.

The Real Work Begins

While a generation of RCT studies has not resulted in the tidy, generalizable outcomes that Duflo hoped for, the studies have produced many detailed individual portraits that can help to flesh out the strengths and limitations of particular strategies for poverty alleviation. As J-PAL and its affiliates mature into the next decade of evaluations, they must focus on the dissemination process of such specific pieces of rigorous evidence with an appropriate dose of humility. In this way, randomistas can remain relevant—not primarily as doers and fixers but as advisors and thought-leaders—to the global development agenda. Short of these steps, the RCT revolution may be talking itself into the annals of economic history.

 

Courtney Han has managed randomized control trials around gender, industrial jobs and financial inclusion in sub-Saharan Africa and Southeast Asia. She is a second year candidate in the Master in Public Administration in International Development program at the Harvard Kennedy School, and is interested in agriculture, aquaculture and evidence-based policy design for developing economies.

 

Photo Credit: Flickr via Creative Commons

ENDNOTES


[1] “J-PAL: Kenya Evaluations,” accessed 10 January 2016, https://www.povertyactionlab.org/evaluations?f[0]=field_country%3A91.

[2] “J-PAL: Kenya Evaluations.”

[3] Ravillion, Martin, “Should the Randomistas Rule?” 2009, http://siteresources.worldbank.org/INTPOVRES/Resources/477227-1142020443961/2311843-1229023430572/Should_the_randomistas_rule.pdf.

[4] AusAID, “3ie and the funding of impact evaluations: a discussion paper for 3ie’s members,” 2011, http://mande.co.uk/blog/wp-content/uploads/2011/12/Discussion-Paper-3ie-and-the-funding-of-impact-eval-FINAL.pdf.

[5] (IEG) Independent Evaluation Group, “World Bank Group impact evaluations: relevance and effectiveness,” World Bank Group, 2012, http://ieg.worldbankgroup.org/Data/reports/impact_eval_report.pdf.

[6] Michael Kremer, “J-Pal@10.” 27 January 2014, https://www.youtube.com/watch?v=YGL6hPgpmDE.

[7] Michael Kremer, “J-Pal@10.”

[8] Michael Kremer, “J-Pal@10.”

[9] Esther Duflo, Michael Kremer, and Jonathan Robinson, “Nudging Farmers to Use Fertilizer: Theory and Experimental Evidence from Kenya,American Economic Review 101, no. 6 (2011): 2350-90, http://www.nber.org/papers/w15131.pdf.

[10] Michael Kremer and Ted Miguel, “Worms: Identifying Impacts On Education And Health In The Presence Of Treatment Externalities.” Econometrica 72, no. 1 (2004): 159–217.

[11] Daniel Altman, “Small-Picture Approach to a Big Problem: Poverty,” New York Times. 20 August 2002, http://www.nytimes.com/2002/08/20/business/small-picture-approach-to-a-big-problem-poverty.html?pagewanted=all.

[12] Ian Parker, “The Poverty Lab,” The New Yorker, 17 May 2010, http://www.newyorker.com/magazine/2010/05/17/the-poverty-lab.

[13] Dani Rodrik, “Goodbye Washington consensus, hello Washington confusion? A review of the World Bank’s economic growth in the 1990s: learning from a decade of reform.” Journal of Economic literature 44, no. 4 (2006): 973-987.

[14] Dambisa Moyo, Dead Aid: Why Aid Is Not Working And How There Is A Better Way For Africa. (Vancouver: Douglas & McIntyre, 2009).

[15] Elsa V Artadi, and Xavier Sala-i-Martin, “The Economic Tragedy of the XXth Century: Growth in Africa,” NBER, no. 9865 (2003), http://www.nber.org/papers/w9865.pdf.

[16] Robert Dixon, The Origin of the Term” Dismal Science” to Describe Economics. Department of Economics, University of Melbourne, 1999. Accessed at http://www.krannert.purdue.edu/faculty/smartin/ioep/dismal.pdf.

[17] Hanta Devarajan, “Can Randomized Control Trials Reduce Poverty,” The World Bank Blog. 23 March 2011. http://blogs.worldbank.org/africacan/can-randomized-control-trials-reduce-poverty.

[18] John Lichfield, “Step aside, Sartre: this is the new face of French intellectualism.” The Independent. 12 January 2009, http://www.independent.co.uk/news/world/europe/step-aside-sartre-this-is-the-new-face-of-french-intellectualism-1332028.html.

[19] Ian Parker, “The Poverty Lab.”

[20] Homepage, https://www.povertyactionlab.org/.

[21] “J-PAL: Kenya Evaluations.”

[22] Neil Buddy Shah, Paul Wang, Andrew Fraker, and Daniel Gastfriend, “Evaluations with impact: decision-focused impact evaluation as a practical policymaking tool,” International Initiative for Impact Evaluation (3ie), no. 25 (2015): 16, http://www.3ieimpact.org/media/filer_public/2015/10/01/wp25-evaluations_with_impact.pdf.

[23] Ruth Levine and William Savedoff, “The Future of Aid: Building Knowledge Collectively,” Center for Global Development. 7 January 2015, http://www.cgdev.org/publication/future-aid-building-knowledge-collectively.

[24] “3ie: Impact Evaluation Repository,” International Initiative for Impact Evaluation, accessed February 10, 2016,

http://www.3ieimpact.org/evidence/impact-evaluations/impact-evaluation-repository/

[25] Annie Duflo and Dean Karlan, “What Data can do to Fight Poverty,” New York Times. 29 January 2016. http://www.nytimes.com/2016/01/31/opinion/sunday/what-data-can-do-to-fight-poverty.html.

[27] Bill Easterly, “Development Experiments: Ethical? Feasible? Useful?” 15 July 2009. http://www.nyudri.org/aidwatcharchive/2009/07/development-experiments-ethical-feasible-useful.

[28] Ricardo Hausmann, Interview by Courtney Han. Personal Interview. [Cambridge, MA], 3 February 2016.

[29] Ricardo Hausmann, Interview by Courtney Han.

[30] Lant Pritchett, “Using ‘Random’ Right: New Insights from IDinsight Team,” Center for Global Development, 10 December 2015, http://www.cgdev.org/blog/using-%E2%80%9Crandom%E2%80%9D-right-new-insights-idinsight-team?utm_source=IDinsight+News&utm_campaign=84e2e34283.

[31] Dani Rodrik, “Doing Development Better,” Project Syndicate, 11 May 2012, https://www.project-syndicate.org/commentary/doing-development-better.

[32] Ian Parker, “The Poverty Lab.”

[33] “Deworm The World,” Evidence Action, accessed 25 January 2016, http://www.evidenceaction.org/#deworm-the-world.

[34] “Safe Water Dispensers,” Evidence Action, accessed 25 January 2016, http://www.evidenceaction.org/#dispensers.

[35] David K. Evans and Anna Popova, “Cash transfers and temptation goods: a review of global evidence,” Policy Research working paper ; no. WPS 6886; Impact Evaluation series ; no. IE 127. Washington, DC: World Bank Group. May 1, 2014, http://documents.worldbank.org/curated/en/2014/05/19546774/cash-transfers-temptation-goods-review-global-evidence.

[36] Calculated as the standard deviation divided by average results of studies within a given intervention category.

[37] The coefficient of variation was 1.9 in social science RCTs compared to 0.1 to 0.5 in medical trials.

[38] The R-squared was in very low ranges of 0.04 to 0.16.

[39] Eva Vivalt, “How Much Can We Generalize from Impact Evaluations?” Stanford University, 9 February 2016, http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.28.14.pdf.

[40] “J-PAL 2015: Scale-Ups,” The Abdul Latif Jameel Poverty Action Lab, accessed 10 February 2016, http://www.povertyactionlab.org/scale-ups.

[41] Esther Duflo, Pascaline Dupas, and Michael Kremer, “School Governance, Teacher Incentives, and Pupil Teacher Ratios: Experimental Evidence from Kenyan Primary Schools,” NBER Working Paper 17939, 2012.

[42] Tessa Bold, Mwangi Kimenyi, Germano Mwabu, Alice Ng’ang’a and Justin Sandefur, “Scaling up what works: experimental evidence on external validity in Kenyan education. Working Paper 321.” Center for Global Development. March 12, 2013, http://www.cgdev.org/sites/default/files/Sandefur-et-al-Scaling-Up-What-Works.pdf.

[43] Suleiman Asman, Interview by Courtney Han. Personal Interview. [Nairobi, Kenya], 13 January 2016.

[44] Carolyn Nekesa, Interview by Courtney Han. Phone Interview, 15 January 2016.

[45] Drew B. Cameron, Anjini Mishra and Annette N. Brown, “The growth of impact evaluation for international development: how much have we learned?” Journal of Development Effectiveness, 8:1, 1-21, 14 February 2016, http://www.tandfonline.com/doi/pdf/10.1080/19439342.2015.1034156.

[46] Francis Meyo, Interview by Courtney Han. Personal Interview. [Nairobi, Kenya], 14 January 2016.

[47] Francis Meyo, Interview by Courtney Han.

[48] Shah et al., “Evaluations with impact: decision-focused impact evaluation as a practical policymaking tool, 16.

[49] Shah et al., “Evaluations with impact: decision-focused impact evaluation as a practical policymaking tool, 16.

[50] Dean Karlan, “The Goldilocks Project: Helping Organizations Build Right-Fit M&E Systems,” Innovations for Poverty Action, http://www.poverty-action.org/goldilocks.

[51] Shah et al., “Evaluations with impact: decision-focused impact evaluation as a practical policymaking tool, 27.

[52] Shah et al., “Evaluations with impact: decision-focused impact evaluation as a practical policymaking tool, 27.

[53] Suleiman Asman, Interview by Courtney Han.

[54] Suleiman Asman, Interview by Courtney Han.

[55] Francis Meyo, Interview by Courtney Han.