Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 77
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary 4 Designing an Impact Evaluation with Robust Methodologies This chapter summarizes workshop discussions on methodological issues related to impact evaluation design for the President’s Emergency Plan for AIDS Relief (PEPFAR) and is divided into three sections. In the first section, a diverse set of case studies of conceptual models and methodological approaches are presented from previous large-scale evaluations—from the World Bank, the Abdul Latif Jameel Poverty Action Lab at the Massachusetts Institute of Technology (Poverty Action Lab), the UK Department for International Development (DFID), the Cooperative for Assistance and Relief Everywhere, Inc. (CARE), and The Global Fund to Fight AIDS, Tuberculosis, and Malaria (The Global Fund). In the second section, methodological challenges and opportunities of impact evaluation are described for the measurement of outcomes and impacts specific to human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS), for the measurement of more general outcomes and impacts, for attribution and accounting, and for the aggregation of impact results. The third section summarizes themes common to the approaches. CONCEPTUAL MODELS AND METHODOLOGICAL APPROACHES: CASE STUDIES Impact evaluations require the development of a conceptual model. The model must be defined, the inputs and outcomes measured, and assumptions and conversion factors determined. For prevention of mother-to-child transmission of HIV (PMTCT), noted speaker Sara Pacqué-Margolis of
OCR for page 78
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary the Elizabeth Glaser Pediatric AIDS Foundation, there is a clear, logical pathway between access to services, counseling and testing, test results, prophylaxis by women and infants, and aversion of infections. Assumptions and conversion factors to be determined for PMTCT can include questions like the following: What regimens are taken and how effective are they? Are they actually consumed and when? What is the rate of transmission during labor and delivery? What is the rate of prevention of infections in HIV-negative women who come in for counseling? What is the level of infection transmitted through breast milk? Speaker Carl Latkin of the Johns Hopkins School of Public Health cautioned that although models of change are needed to guide interventions, sometimes they don’t explain findings. Models are practical heuristics but should not be blinders, he noted; we should not let models narrow the way we look at change. Impact evaluations also require the use of methodological approaches. These can include quantitative, qualitative, and participatory methods and theory-based program logic. Examples of impact evaluation methods, provided by speaker Mary Lyn Field-Nguer of John Snow, Inc., include client satisfaction interviews and surveys, exit interviews, mystery clients, targeted intervention research, focus groups, and key informant interviews. The following case studies describe the experiences from evaluations of five HIV/AIDS assistance programs run by the World Bank, Poverty Action Lab, DFID, CARE, and The Global Fund. Conceptual models and different evaluation methodologies are described in the context of each study. World Bank Evaluation of HIV/AIDS Assistance Programs Workshop speaker Martha Ainsworth, lead economist and coordinator of the Health and Education Evaluation Independent Evaluation Group at the World Bank, described the approach and methodologies used in an independent evaluation of the World Bank’s HIV/AIDS assistance programs. The evaluation assessed $2.5 billion of World Bank investments in HIV/ AIDS prevention, care, and mitigation programs between 1988 and 2004 in 62 developing countries. Two objectives of the evaluation were defined: (1) to evaluate the development effectiveness—or relevance, efficiency, and efficacy—of HIV/AIDS assistance in terms of lending, policy dialogue, and analytic work at the country level relative to the counterfactual, or absence of a Bank program and (2) to identify lessons to guide future activities. Ainsworth shared the World Bank’s experience in prioritizing what to measure in evaluation. Although the World Bank has a large portfolio of complementary programs in education and agriculture, indicators were narrowed down to only those with direct HIV/AIDS outcomes and impacts. In addition, identifying how lessons from completed assistance were still relevant to new approaches posed a challenge, given that three-quarters of
OCR for page 79
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary the HIV/AIDS assistance programs being evaluated were still in progress. In assessing a long-term, ever-changing implementation approach over time, therefore, the World Bank evaluation was designed to select those issues that were common to all projects, such as political commitment, setting strategic priorities, multisectoral responses, ministry of health role, use of nongovernmental organizations (NGOs) in implementation, and monitoring and evaluation (M&E). The World Bank evaluated the projects completed in the past and examined those issues relevant to ongoing projects. Through this approach, the assumptions and design of the ongoing portfolio were analyzed and prospectively evaluated. The World Bank was able to consider design issues and point out where risks had been mitigated and where problems could be addressed through midstream adjustments. The World Bank evaluation drew on a number of methodological approaches. As Ainsworth noted, the World Bank does not rely exclusively on a single source of information, but rather uses different types of evaluations already occurring in the context of the work, such as midterm reviews, completion reports, and annual reviews. Evaluation methods used include the following: Results chain documentation: Inputs, outputs, outcomes, and impact of government, the World Bank, and other donor efforts were gathered. Time lines: The documentation of timing of efforts was collected, although in many activities this type of M&E information is lacking. Interviews: Some information was elicited from interviews of stakeholders, other donors, people and staff involved on the ground, and government implementers. Desk work: The following were collected and analyzed: literature reviews; archival research; interviews on the time line of World Bank response; an inventory of analytic work; a portfolio review of health, education, transport, and social protection sectors; and background papers on national AIDS strategies. Surveys: Surveys were conducted of staff members, audiences for analytic work, project task team leaders, and country directors. Field work: Project assessments and case studies—chosen to reflect different levels of experience and where interventions worked or did not work—were collected and reviewed. For example, a project in Indonesia, canceled because the World Bank intervention occurred before anyone was visibly ill, was chosen for the evaluation, as was a project in Russia, where only policy dialogue and analytic work were conducted.
OCR for page 80
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary Use of Randomized Controlled Trial Methodologies to Evaluate HIV/AIDS Programs Rachel Glennerster, executive director of the Abdul Latif Jameel Poverty Action Lab at the Massachusetts Institute of Technology, described the application of randomized controlled trial methodology to HIV/AIDS program evaluation. She described the advantages and disadvantages of randomized trial methodologies and then discussed the results from two case studies in which randomized methods were used, an evaluation of an HIV education program in Kenya and an HIV status knowledge program in Malawi. Advantages and Disadvantages of Randomized Evaluations To know the true impact of a program, one must be able to assess how the same individual or group would have fared with and without an intervention. Because it is impossible to observe the same individual in the presence or absence of an intervention simultaneously, comparison groups that resemble the test group are commonly used. Common approaches for selecting comparison groups include a “before and after” approach, in which the same group of individuals are compared before and after exposure to an intervention, and a “cross-sectional” approach, in which, at a single point in time, a group of countries or communities in which an intervention has occurred are compared to a “non-intervention” group. However, programs are usually started in particular places at certain times for a reason, and they are usually established with the countries, communities, schools, and individuals most committed to action. Therefore, estimates of program impact may be biased because it is difficult to find a comparison group that is equally committed to those where the program was established. This may in part explain why projects typically work well in a few places, but fail when scaled up. In randomized controlled trials, like medical clinical trials, those who receive the treatment and the control group are selected randomly. By construction, those who receive the proposed new intervention are no more committed, no more motivated, no richer, and no more educated than those in the control group. Randomized trials produce results that are freer from bias than other epidemiological studies. Randomized evaluations can be used to test the efficacy of interventions before they are eventually scaled up to the national level. Randomized trials conventionally have been used to look at drug effectiveness, but are also being applied to other areas where they are not commonly used. For example, randomized trials can be used to investigate social patterns, such as what messages are most effective in changing the sexual behavior of young girls.
OCR for page 81
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary There is a perception that randomized evaluations are difficult both to implement and to integrate with what is going on at the ground level, but with innovations in randomization over the past 10 years, randomized studies are less intrusive and less like more formalized clinical trials. Several mechanisms exist to more naturally introduce randomization into the way a government works or with the way an NGO works on the ground, including the following: Lottery: Randomization can be introduced through a lottery if a program is oversubscribed. Beta testing: Randomization can be introduced through small-scale experimentation of methods before scaling up to the national level. Randomized phase-in over time and space: Capacity or financial constraints may limit the ability to introduce interventions in all communities immediately. The order in which a program is phased in can be randomized, allowing for an assessment of effectiveness to be made during the phase-in period. Encouragement design: Often, national programs that are up and running do not have 100 percent adoption; the impact of such programs can be evaluated by randomly encouraging some people to participate in the program. Several of these mechanisms simultaneously help to address some of the ethical questions surrounding randomized design—the exclusion of people from having access to care or programs that might save their lives. In the randomized phase-in approach, all individuals will ultimately benefit from the intervention; under the encouragement design, no one is denied care. A disadvantage of randomized evaluation is that it cannot be done after the fact; it must be implemented with the program. Institutional constraints are another disadvantage to randomized evaluation that sometimes make it more difficult to engage with partners in an intensive way. One workshop participant noted that randomized controlled trials can be difficult to translate from the individual level to the community level, where interventions are more complex. Glennerster acknowledged that randomized controlled trials can be improperly designed and can thereby generate incorrect results. Using Randomized Trials to Evaluate HIV/AIDS Education Programs in Kenya Randomized trial methodology was used to evaluate a Kenyan HIV/ AIDS education project, a collaborative effort among the government of Kenya, a local NGO, U.S. universities, and Jomo Kenyatta University in
OCR for page 82
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary Kenya. The method was used in randomly chosen schools to test a range of education strategies for their effectiveness in getting children to understand messages about the risks of HIV. These strategies included training teachers in a new HIV/AIDS education curriculum, reducing education costs to encourage young girls to stay in school, holding debates about whether or not to teach about condoms in primary schools, holding essay competitions about protection from HIV, and telling children about relative infection rates by age, including the dangers of sexual, gift-exchanging relationships between young girls and older men (sugar daddies), the greater likelihood of older men to be infected than younger men, and the greater likelihood of girls to be infected than boys. Upon implementation of each program, the evaluation tracked observed changes in behavior, including school dropout rates, marriage, pregnancy, and childbirth, as determined by community interviews. Follow-up studies are also tracking HIV infection rates under each type of intervention. Results from the trial are shown in Figure 4-1. FIGURE 4-1 Impacts of alternative HIV/AIDS education strategies on girls’ behavioral outcomes. NOTE: Indicates that the difference with the comparison group is significant at 10 percent. SOURCES: Duflo et al., 2006, and J-PAL, 2007.
OCR for page 83
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary The teacher training in the national curriculum had little effect on school dropout rates, marriage, and childbirth, although girls from schools where the training was conducted were more likely to be married if they had a child, and there was a slight effect on increasing tolerance of those with HIV in schools that underwent the training. Reducing the cost of education was found to be an effective strategy for reducing dropout, marriage, and childbirth rates. Education programs about the dangers of sexual relations with older men, or sugar daddies, led to a 65 percent drop in pregnancies or childbirths with older men and no increase in childbearing with younger men. Self-reported data indicated a shift between having relationships with older men to having relationships with younger men. Self-reported data from the boys in the group indicated increased condom use, potentially because boys had learned that girls were much more likely to be infected than boys. Results of the debate and essay interventions remain to be tested with outcome data; currently, only self-reported data exist, which can be very biased. On the basis of the costs of the interventions, the evaluators were able to calculate a cost-per-childbirth-averted rate for each intervention, with the education program about older men being the most cost-effective intervention, at $91 per childbirth averted, compared to $750 per childbirth averted for interventions to reduce the cost of schooling. Using Randomized Trials to Evaluate HIV Status Knowledge Programs in Malawi Although half of HIV/AIDS prevention spending in Africa focuses on HIV testing, many of those tested do not come back to pick up their results. A study conducted in Malawi used randomized evaluation to test the impact of campaigns promoting knowledge of HIV status (Thornton, 2007). Only 40 percent of those tested for HIV returned to collect their results, but the study showed that a small incentive—only 10–20 cents, or a small fraction of the daily wage—was enough to increase results collection by 50 percent. The study went on to test whether or not knowledge of status changed behavior. In follow-up interviews with those who had and had not received encouragement to pick up their test results, people were given the opportunity to buy subsidized condoms and the money to buy them. In comparing the treatment group (those encouraged to and therefore more likely to know their status) with the control group (those who were not encouraged and thus less likely to know their status), the study found that the knowledge of HIV status had virtually no impact on whether people purchased subsidized condoms, even when they were given the money to buy them. Only HIV-positive individuals in long-term partnerships were more likely to buy condoms if they knew their status, and few bought subsidized condoms.
OCR for page 84
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary Glennerster cautioned that if randomized methodologies are not used and if studies survey only the sample that returns for test results, it may appear as if knowledge of status is effective in reducing HIV incidence. A randomized methodology allows researchers to tease out proper attribution for the perceived success of a program. Glennerster also noted that the use of plausible correlation approaches—suggested by workshop speaker Paul De Lay of the Joint United Nations Programme on HIV/AIDS (UNAIDS) as a more practical methodology applicable to work at the country level—without doing a full trial can also lead to the wrong policy conclusion. With millions of dollars being invested in knowledge-of-HIV-status programs, it is worth testing whether they are effective in reducing incidence, she concluded. DFID Evaluation of the National HIV/AIDS Strategy Speaker Julia Compton, senior evaluation manager of the Evaluation Department, DFID, described a recent evaluation of the UK national HIV/ AIDS strategy, “Taking Action,” a comprehensive and far-reaching $3 billion, 5-year effort launched in 2004, which included a substantial overseas investment component. This national strategy cuts across the UK government and involves six priority areas. The following four objectives were defined for the evaluation: Developing recommendations for improving implementation Developing recommendations for how to measure success: indicators Developing recommendations for a future UK strategy on HIV and AIDS Developing recommendations for other UK government strategies Through an extensive consultative process, DFID identified 13 evaluation questions focusing on inputs and processes specific to decisions, for example, the usefulness of spending targets and the effectiveness of country-led approaches. The evaluation used several methodologies. Seven case studies of countries were conducted and three working papers were developed to gain an understanding of spending, M&E frameworks, and challenges in reaching women, young people, and vulnerable groups. The evaluation was a heavily consultative process; in fact, the process of communications and consultations during the evaluation process may have had greater impact on changes in the strategy than the actual evaluation data, remarked Compton. The process of evaluation motivated DFID to make changes needed to achieve positive results. Compton cautioned that concentrating too narrowly on the data—at the expense of communi-
OCR for page 85
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary cation and understanding what policy makers want—may result in missed lessons from evaluation. A major challenge to the DFID evaluation was the declining quantity and quality of data collected at projects in-country. Because DFID relies heavily on country-led approaches and country systems to collect data, this was a major constraint to the evaluation. CARE Evaluation of Women’s Empowerment Programs Kent Glenzer, director of the Impact Measurement and Learning Team at CARE, described the approach and methodology of a multiyear evaluation of the impact of women’s empowerment interventions. The evaluation is a $500,000 effort assessing interventions at field sites in more than 40 countries, plus 900 other projects through secondary data. This evaluation is being conducted to inform organizational change at CARE, a private, international humanitarian organization with a focus on fighting global poverty. CARE uses a literature-based theory of social change and defines the concept of empowerment as a process of change in women’s agencies, social structures, and relations of power through which women negotiate claims and rights. CARE’s approach for evaluating complex systems, such as women’s empowerment, involves bringing together experts—internal, external, and local—and coupling M&E with project implementation. In CARE’s experience, local actors know and understand systemic changes better than external experts; therefore, CARE’s role is to bring actors—most importantly women and girls—together over the long term to discuss systems changes, develop hypotheses, and build collective knowledge about change. CARE is tracking change across 23 categories of women’s empowerment. Indicators—including those developed by local men and women—are developed at multiple levels for each category and include measures of individual skills or capabilities; measures of structures such as laws, family and kin practices, institutions, and ideologies; and measures of relational dynamics, such as those between men and women and between the powerful and less powerful. Although across the sites the indicators are different, broad patterns can be compared relating to where and how change is happening. The following attributes of a successful evaluation approach, from the perspective of CARE, were outlined: Evaluation is a long-term learning experience that should unite relevant actors. Evaluation should be flexible enough so that different dependent
OCR for page 86
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary variables can be specified in different contexts, but should be designed to permit comparison of variables across contexts. Centrally planned, mixed-method evaluation designs work best. The Global Fund Evaluation Stefano Bertozzi, member of the Technical Evaluation Reference Group of The Global Fund, described a 5-year evaluation plan for The Global Fund, which will focus on 8 countries in depth, plus 12 others using secondary information. The evaluation is a “dose-response design,” meaning it will look for correlations between intensity of project implementation and changes in trends of the HIV/AIDS epidemic in terms of survival of infected individuals and prevention of new infections. The plan includes evaluation of the following three major topics: Organizational efficiency: Operations, business model, and governance structure in The Global Fund, which are based on technical reviews of country-generated proposals with little country presence other than auditing firms, will be evaluated. Partnership environment effectiveness: Country and grant performance will be evaluated, including the effectiveness of mobilization of technical assistance and effectiveness of country-coordinating mechanisms. Health impact: The health impact of The Global Fund on the three diseases it covers (HIV/AIDS, TB, and malaria) will be evaluated. MACRO International Inc., Harvard University, the World Health Organization (WHO), and Johns Hopkins University are implementing the evaluation, and data collected by MACRO through Demographic and Health Surveys-Plus (DHS+)1 will serve as the baseline assessment. The limited budget of the evaluation will not permit the conduct of large-scale surveys. METHODOLOGICAL CHALLENGES AND OPPORTUNITIES IN EVALUATING IMPACT Workshop participants described methodological challenges and opportunities in evaluating the impact of PEPFAR, including those in measuring outcomes and impacts specific to HIV/AIDS, measuring broader impacts and outcomes, attributing results, and aggregating the results of impact evaluation. The discussions were wide-ranging and touched on many chal- 1 Demographic and Health Surveys including HIV prevalence measurement are known as “DHS+.”
OCR for page 87
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary lenges and opportunities, but were by no means an exhaustive or prioritized list of considerations or an in-depth analysis of any one of them. Measuring HIV/AIDS-Specific Outcomes and Impacts HIV/AIDS-specific outcomes and impacts include the measurement of HIV prevalence, incidence, infections averted, mortality rates, development of drug resistance, orphanhood prevention, behavioral change, and stigma and discrimination. Workshop participants described methodological challenges and opportunities in each of these areas. Measuring Change in HIV Prevalence HIV prevalence is the proportion of individuals within a population infected by HIV during a particular time. It is a function of both the death rate of those already infected and the rate at which new infections occur. Repeated surveillance of pregnant women at antenatal clinic (ANC) sentinel sites is currently the most common method for measuring changes in HIV prevalence. Workshop speaker Theresa Diaz of the U.S. Centers for Disease Control and Prevention (CDC) pointed out some of the challenges and limitations of using this approach. Comparison with nationally representative household-based surveys shows that the ANC surveillance method tends to overestimate prevalence, she said, because ANCs are predominantly urban. In addition, the ANC methodology does not take into account other factors, such as the change in use of clinics over time, increased survival, or immigration, which can lead to a change in HIV prevalence. The method is also unreliable for measuring prevalence in areas where epidemics are concentrated in high-risk groups, such as Vietnam. Diaz noted that a number of new tools are now becoming available to analyze prevalence trends more effectively. CDC uses a suite of methods (chi-square, linear, trend, linear regression, and nonparametric methods) for analyzing prevalence trends using only the most consistent ANC sites and the most recent data. In addition, a second population-based survey of HIV testing will soon be available in some countries to allow analysis of HIV prevalence over time. The collection of data on antiretroviral (ARV) use—both from ANC sentinel surveillance surveys and from the population-based surveys—would allow better prevalence data to be collected, in addition to data on coverage. Finally, methods such as respondent-driven sampling are being standardized for collecting HIV sero-prevalence data among high-risk groups. When such methods use the same sampling methodology in the same place over time, trends can be observed.
OCR for page 100
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary plies, lab services, and curative care services. Quantitative provider surveys were used to measure impact on individual providers and facilities receiving funds and to assess training, supervision, motivation, and job satisfaction. In-depth qualitative interviews with important stakeholders were also conducted throughout the entire health system. Novak stressed the importance of monitoring both positive and negative impacts of interventions, which can help countries address critical issues in the health system. For example, although the SWEF evaluation results showed positive impacts on the health system—such as greater participatory engagement, decentralization, the emergence of new public–private collaborative arrangements, creation of improved incentives and work environment for those working in HIV/AIDS, and harmonization of pricing and cost-recovery approaches—there were also some negative impacts, such as delivery-level constraints as HIV/AIDS drew both human resources and services away from other health areas, and poorly functioning procurement and distribution systems in some countries. Challenges of using this more descriptive methodological approach include the lack of empirical estimates of impacts, small sample size, short time interval over which change was evaluated, and lack of ability to easily attribute impact. Evaluating impact of HIV/AIDS interventions on non-HIV primary health care services. Jessica Price, Rwanda country director of Family Health International (FHI), presented results from a study conducted in Rwanda testing the hypothesis that HIV/AIDS interventions strengthened the number of non-HIV primary health care services. Study data were derived from the review of monthly activity reports submitted by health centers to the government of Rwanda. The study compared the quantity of non-HIV health services delivered before and after the introduction of basic HIV care, defined as services including counseling and testing, PMTCT, preventive therapy, and basic upgrades to health center infrastructure. The study assessed 30 FHI partner health centers from 4 provinces and 14 districts in Rwanda, representing 21 faith-based centers and 9 public centers. Hospitals that do not deliver some non-HIV services and health facilities with fewer than 6 months’ experience delivering basic HIV care were excluded from the study. A set of 88 indicators of non-HIV services delivery was tracked, with 22 indicators considered to represent the best range of public health services. These included general services (such as inpatient and outpatient services and lab tests), reproductive health services, and services for children. In addition to monitoring impacts of HIV/AIDS interventions, the study also tracked impacts of two other health programs—primary health care insurance and performance-based financing—and used regression analysis
OCR for page 101
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary to isolate the independent effects of HIV/AIDS interventions. The analysis consisted of calculating mean quantities of non-HIV services delivered per primary health center per month between the two time periods, testing for significant differences, and conducting regression analysis to control for experience with other health programs (insurance and performance-based financing) to determine which program, if any, had an independent effect on the observed change. The HIV programs were shown to have had an independent effect in a number of indicators across a range of areas. These areas included improved coverage for antenatal visits and services, use of health care facilities for maternity services by HIV-positive women, syphilis screening, family planning services, child vaccination and growth-monitoring services, outpatient consultations, and hospitalization services. Limitations and challenges of the methodology were discussed. In future analyses, evaluation of the impacts of HIV programs should also include hospital settings. Indicators could also be tracked for impacts on other diseases (such as, malaria, TB, and sexually transmitted infections), quality of patient care, costs of HIV-specific services (such as HIV tests) versus non-HIV-specific services (such as infrastructural upgrades like incinerator construction and maintenance of electricity), and client and provider satisfaction. Future studies should also look at larger sample sizes over longer time periods. A random selection of sites should also be considered in future studies, noted speaker Field-Nguer. The fact that all chosen sites were FHI partners may have given them a competitive edge, she noted. If being FHI sites did not confer an edge, then perhaps access to services can be replicated at any site in Rwanda. But if FHI status did confer an edge, then perhaps unique attributes of the partnership can tell us something about how to replicate the impact, she noted. Workshop participant Laura Porter of CDC added that future studies will need to ensure that service delivery improvement is a real effect and not just an artifact of data system improvement. Measuring Impact of Complementary Interventions As described in Chapter 2, PEPFAR investments include numerous interventions in programs complementary to more narrowly focused HIV services. These so-called wraparound programs include interventions in areas such as malaria, TB, nutrition education, food security, social security, education, child survival, family planning, reproductive health, medical training, health systems, and potable water. Workshop speaker Bertozzi described methodologies from two case studies from Mexico in which such complementary interventions were
OCR for page 102
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary evaluated: a human-capacity development program for children and a food assistance program. The Oportunidades program is a Mexican government–sponsored human-capacity development program for Mexico’s poorest children. Financial incentives to parents are offered through the program for ensuring children’s participation in health, nutrition, and educational services. The Programa del Apoyo Alimentario (PAL) program provided food assistance—either food or cash payments—to small rural communities in Mexico. Impact evaluations of both Oportunidades and PAL were conducted using prospective randomized evaluation, in which later program enrollees were compared to earlier program enrollees. Both health impacts and education impacts were monitored through the evaluations. For Oportunidades, health indicators tracked include use of preventive services (such as well visits and vaccinations), use of curative services, out-of-pocket expenditures, and anemia prevalence. PAL health impacts monitored included height-for-age, weight-for-height, and weight-for-age. Education indicators monitored in the Oportunidades program included grade-level achievement, attendance, early enrollment, and repetition of grades. The evaluative approach from these studies could potentially be applied to the evaluation of complementary interventions in the PEPFAR program, particularly to health and educational interventions targeting orphans and vulnerable children, noted Bertozzi. Other indicators of “basic capability” child care interventions could include zinc status, sick days, days incapacitated, prevalence of risky and healthy behaviors (such as alcohol use, sexual activity, and exercise), and educational performance. Bertozzi emphasized the importance of controlling for secular—long-term, noncyclical—trends in impact evaluation. Such trends can sometimes have a large effect independent of the intervention. For example, malnutrition indicators were tracked in the poorest rural communities in Mexico in the 5 years leading up to the start of the PAL program (ENN-1999 versus PAL-2004, the baseline for the PAL intervention). In the absence of any intervention, noted Bertozzi, extraordinary secular trends led to a halving of malnutrition indicators in these communities. Any intervention conducted during this 5-year period would have given the appearance of stimulating a large positive effect when there might have been none at all—or perhaps even a negative effect. Measuring Impacts of Gender-Focused Activities Workshop participants discussed some of the challenges and opportunities for evaluating the impacts of gender-focused activities, including those interventions to promote gender equality and women’s empowerment. Noting that gender equality and women’s empowerment are multidimen-
OCR for page 103
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary sional, open, complex, nonlinear, and adaptive systems, speaker Glenzer observed that it is seldom clear what variables are or are not involved. It is a challenge to define what constitutes success and what it looks like on the ground. Glenzer said some of the difficulty in tracking change of gender systems relates to the following characteristics: the large-scale effects of small changes over time, the separation of causes and effects over large spatial and temporal scales, the multiple levels over which change may occur, and the heterogeneity of systems. Speaker Julie Pulerwitz of the Population Council acknowledged the difficulty in implementing rigorously designed evaluations and called for more consensus building about how to operationalize the concept of gender and how to evaluate gender-related activities. Although gender is generally recognized as important, she added, there have been few outcome evaluations and few tools developed on how gender-focused activities affect HIV risk. Few good indicators exist that are useful in understanding social dynamics, and evaluation schemes often underrepresent the perspectives of local people, who are a source of such knowledge, noted Glenzer. Speaker Pulerwitz described a new method now available for studying the impacts of gender-focused activities and how those impacts can contribute to PEPFAR goals. Pulerwitz directs an operations research program called Horizons at the Population Council that has conducted studies using this method. Pulerwitz shared the study design and tools used for an evaluation of gender-focused programs—group education, community-based behavioral change communication campaigns, and clinical activities—focused on young men in Brazil. A combination of data collection approaches were used, including the following: Pre- and postintervention surveys and a 6-month follow-up survey for three groups of young men—two intervention groups and a comparison group, which eventually also received the interventions after a time delay—followed over a year In-depth interviews with a subsample of young men and their sexual partners Costing analysis and monitoring forms for different activities An evaluation tool called the Gender Equitable Men’s (GEM) scale was used to look at gender norm attitudes and how they changed over time (Barker, 2000; Pulerwitz and Barker, 2008). The scale includes 24 items, including parameters such as home and child care, sexual relationships, health and disease prevention, violence, homophobia, and relations with other men. Certain GEM scale domains are associated with partner violence, level of education, and contraception use. The GEM tool was used to detect significant changes in attitude toward equitable gender norms and
OCR for page 104
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary in support of inequitable gender norms in the two intervention groups as compared to the control group. HIV outcomes—condom use with primary partners—were also tested, and one of the intervention groups showed an increase as compared to the comparison group. The study also looked at covariance between changes in attitudes toward norms and changes in condom use; men who were more gender equitable were more likely to report condom use. The in-depth interview component of the analysis unearthed other changes among those in the test groups, including a delay in sexual activity in new relationships. The evidence generated by the evaluation is supportive of interventions that target gender dynamics and their influence on HIV risk behavior in Brazil, concluded Pulerwitz. She noted that there are ongoing or planned efforts to adapt the GEM tool to other country contexts—India, Ethiopia, Namibia, Uganda, and Tanzania—and to other demographic groups, such as married men. Preliminary findings show that results can be highly country specific. Although a similar trend toward more equitable attitudes has been observed in the work conducted in India, baseline attitudes in that country are much less supportive of equitable gender norms than those in Brazil. Measuring Coordination and Harmonization Workshop speaker De Lay spoke of a new opportunity for measuring coordination and harmonization—the alignment of interventions with country-level plans and coordination of efforts among other implementing partners. A new tool, known as the Country Harmonization and Alignment Tool (CHAT), developed by UNAIDS and the World Bank, is now available and could be applied to the standardization of alignment of interventions with country-level plans and coordination of efforts among partners (UNAIDS, 2007a). The tool has been used to assess harmonization and alignment of the national plan, coordinating mechanism, and M&E plan in six pilot countries, and a launch of the tool is planned in two more countries. The tool has revealed that many national plans are still not credible, not costed appropriately, not prioritized, and not actionable. In addition, the tool has shown that few countries have a central funding channel or single procurement system for the HIV/AIDS response. The tool has also shown that “basket funding,” or joint funding by multiple donors, is not normally used. Although donors support the notion of the development of indigenous national M&E capacity, the tool has revealed that in practice donors usually rely on their own M&E systems to collect urgent data when needed.
OCR for page 105
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary Measuring Community-Level or Population-Level Service Delivery Workshop speakers spoke of the challenges of scaling up successful service-delivery interventions for specific populations, such as children, families, communities, and high-risk groups. As workshop speaker Bertozzi observed, sometimes it is difficult to distinguish between a community-level or population-level effect and the effect of an intervention. Tools are needed, noted speakers Kathy Marconi of OGAC and Stoneburner, to measure the effectiveness of interventions in specific populations, including communities, diverse populations, and at-risk or infected populations. Speaker Field-Nguer announced that a new and important addition to the evaluation toolbox is now available: community-level program information reporting systems (CLPIR) (personal communication, R. Yokoyama, John Snow, Inc., January 18, 2008). CLPIR indicators look strictly at community-level service delivery and help answer questions such as when, how, and where people want testing and treatment. Attributing Impact Given the diversity of programs and funders, attributing impact—or relating a particular effect to the work of a specified agent—is a substantial methodological challenge in evaluation, workshop participants said. The World Bank experience shows that because loans or grants are made to governments, speaker Ainsworth said, performance of activities depends heavily on governments, and it is therefore difficult to disentangle the efforts of government and any particular donor from the efforts of all other donors. Even within the programs of a single donor, noted speaker Gootnick, accounting can be complex. Some interventions can be double counted; for example, voluntary counseling and testing is included under both the prevention and care modalities. As PEPFAR moves increasingly toward more harmonized approaches, noted speaker Compton, it will be even more difficult to disentangle effects in an exclusive way. Many workshop participants agreed that the demand for exclusive attribution by donors may not be constructive. General evaluation of what is and is not working, in contrast, may be desirable, noted workshop moderator Ruth Levine of the Center for Global Development. Speaker Glennerster emphasized that it is preferable to test what works in very specific areas and then judge a program by whether it spends money on interventions whose effectiveness is supported by evidence. All programs are doing many things in-country; they are implementing many different policies. If we want to be effective in focusing resources on what works, we need to identify which interventions have the most impact and which are most cost-effective, she said. Speaker Diaz reinforced this idea, stating that a worthwhile attribu-
OCR for page 106
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary tion goal should be to know the effectiveness of certain programs and their coverage in terms of impact measures. A useful attribution exercise, she suggested, might be to determine what level of ART coverage decreases general mortality and what types of prevention activities, in which populations, decrease HIV incidence. Ainsworth added that it is nevertheless useful to analyze the value added of the unique approaches of particular donors. An important dimension of attribution is the concept of the counterfactual, or the assessment of what would have happened differently had the donor not intervened. Some speakers noted that absence of the donor does not necessarily imply that nothing would have happened. Discussant Jim Sherry of George Washington University observed that one consequence of donor interventions is that the donor occupies a particular space and prevents other organizations from filling it. As speaker Bertozzi pointed out, in the case of South Africa, even if outside institutions did not intervene, given the massive social mobilization potential in the country, dramatic change could have been effected without outside help. Aggregating Evaluation Results Several speakers noted that the synthesis or aggregation of evaluation results is a methodological frontier. Workshop participant David Dornisch of the U.S. Government Accountability Office proposed that meta-analysis or synthesis could be used to bring together the results of multiple studies. From the congressional perspective, workshop participant Naomi Seiler from the U.S. House of Representatives Oversight Committee also stated that while prospective evaluation is useful, any type of meta-analysis or synthesis of what is already known about types of interventions, contexts, and populations would be helpful. Discussant Jimmy Kolker of OGAC echoed the need for data synthesis to be relevant to designing or implementing a program. Workshop discussant Sherry observed that such methods have yet to be developed, however. Sherry predicted that the clustering of country-level assessments and evaluations will likely provide much more information through meta-analysis than one definitive, globally executed impact study. Although there is room for both kinds of evaluations, he noted, there is substantial room for improvement on meta-analysis to look statistically at the results of these studies. Sherry observed that there may be inadequate separation of macro-, micro-, and meta-level evaluation processes, leading to an evaluation either not making sense to policy makers or not being rigorous enough for scientists. Micro-level evaluation tends to be too technical and too situation-specific to be digestible to institutions or useful for interventions. Macro-level evaluation tends to be too soft and too subject to evaluation spin to be digestible or credible. Durable findings are needed
OCR for page 107
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary about programs that allow for more sustainable dialogue and learning at the meta-level in terms of evaluation. Another workshop participant raised a question about the value of performing multiple evaluations. Speaker De Lay commented that although it is sometimes desirable to avoid duplication where it is not needed, sometimes duplication is necessary and multiple perspectives are desirable. For example, validation of existing data by an independent group is often a useful alternative to redoing an entire study. THEMES COMMON TO EVALUATION METHODOLOGIES AND APPROACHES This section distills some of the main messages and themes common to the discussions about evaluation methodologies and approaches. Prioritization Most evaluations require some type of prioritization to narrow down what is to be measured. Speaker Ainsworth noted that for long-term evaluations, for example, one might select only those issues common to all projects. For a large portfolio of activities, she added, one might select a more narrowly defined set of indicators. Value of Consultation and Communication Several speakers emphasized the value of consultation and communication in any evaluation approach. Speakers Compton and Glenzer observed that consultation and communication through the evaluation process are as important in effecting change and course corrections as the data from the evaluation results. It also matters who is consulted, observed speaker Field-Nguer. Value of a “Learning” Evaluation Many of the evaluation methodologies described were formative, or “learning” evaluations, designed to help improve institutional performance. As Glenzer noted, evaluation is a long-term learning experience that should unite relevant actors. Speaker Ainsworth added that bringing to bear the findings of past support can inform ongoing programs. Using evaluation to understand the variation in outcomes, or the distribution of outcomes within a population, can help us learn, she said. For example, changes in the average life expectancy or the average change in behavior is not as
OCR for page 108
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary interesting as knowing why behavior changed in one group of people but not another. Others emphasized the heuristic value of negative evaluation results. Analysis of failures, observed speaker Field-Nguer, is sometimes more fruitful than success stories. Negative evaluation results should be divulged and shared, one workshop participant urged; if they are not shared, programs lose credibility and waste money. Speaker Glenzer noted that all of CARE’s research reports are published on Emory University’s website and include some research indicating that CARE is not having long-term impacts on women’s empowerment or underlying causes of gender inequality. The emphasis on learning evaluations contrasts with a more typical systemic bias in the international health community in which actors want to see programs continue, noted workshop discussant Sherry. Therefore, instead of using evaluation for learning, it is used to protect our interests and programs. Sherry underscored the importance of sustaining the institutional learning process. The isolation of evaluation departments in international health systems—analogous to the isolation of smart and reflective people in universities, organized into separate compartments so they have minimal effect on the society around them—is one obstacle to institutional learning, he noted. Decision-making cycles, such as 5-year cycles, reauthorizations, or external audits, drive evaluators into prominence briefly but then fade away. Also observing the existence of different consumers of evaluation, speaker Nils Daulaire of the Global Health Council emphasized the importance of having a single M&E system that satisfies multiple sets of needs. For example, if a customer for evaluation is Congress, then the evaluation will emphasize putting on the best possible spin, but that must be balanced with the use of evaluation on a daily basis to help improve program development and results. One step in achieving a multiuse system is to give evaluators a role in program management and development as opposed to a peripheral role in projects. Importance of Designing the Evaluation Early Several speakers emphasized the importance of considering evaluation design early in the implementation process so that the design will be appropriate and so that impacts can be detected early. Speaker Compton urged that evaluations be set up at the beginning of the process, and speaker Bertozzi also spoke about some of the drawbacks of an ex-post evaluation. Speaker Glennerster noted that opportunities to use powerful randomization approaches exist, but they can be used only if the design is included at the beginning of an intervention. Field-Nguer and Bertozzi stressed the importance of baseline assessments, without which the wrong conclusions may sometimes be drawn.
OCR for page 109
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary Understanding the Limitations of Models and Data Workshop participants acknowledged the limitations of data and models used in evaluation. Speaker Pacqué-Margolis emphasized that empirical data are often inadequate, lacking, or inaccurate, and speakers Ainsworth and Compton emphasized that poor data quality at the country level is often a serious problem. Speaker Garnett emphasized the existence of data gaps for measuring efficacy in different epidemiological contexts. Age- and sex-specific empirical data are also lacking, noted discussant Fowler. Ainsworth stressed that incentives need to be created to encourage project staff and governments to establish and maintain monitoring efforts. Not all data are of the same quality, participants said. Speaker Glennerster noted that data based on self-reported behavior might have issues regarding reliability. Models are powerful tools that can help in evaluation, but they also have limitations. Speaker Glennerster pointed out that models need to be validated with empirical data, and variables need to be added to them to make them more accurate predictors. Speaker Garnett also observed that models are less reliable predictors when the spread of HIV infection becomes epidemic. Value of Multiple Methodologies Several presenters noted the value of using multiple methodological approaches in evaluation. Speakers Compton and Ainsworth cautioned against relying exclusively on one evaluation methodology, and speaker Field-Nguer pointed out that multiple methods may yield richer results than one or two methodologies. Field-Nguer also noted that lack of a baseline assessment (as was the case in PEPFAR) may increase the importance of using several methodologies, including qualitative measures. Speaker Glenzer reinforced the point with his comment that centrally planned, mixed-method evaluation designs work best. At the same time, the use of multiple methods should be strategic, noted workshop speaker Glennerster. She noted that currently organizations often conduct a confused mix of process/output and impact evaluations in too many places. Instead, she recommended conducting good process evaluations everywhere and a moderate number of high-quality impact evaluations focusing on a few key questions. Value of Randomization Multiple presenters emphasized the value of randomization tools in the conduct of evaluations. Glennerster pointed out that new methods of randomization are now available that integrate with evaluation with
OCR for page 110
Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary minimal disruption. In his presentation, Bertozzi also drew on evidence from randomized controlled trials. Speaker Field-Nguer pointed out that nonrandom selection of sites has the potential to limit or weaken a study. Workshop participant De Lay discussed some of the potential problems with impracticality of randomization. Comparison Across Contexts Several workshop participants stressed the highly contextual nature of change when comparing across contexts. Evaluations that are centrally coordinated to permit comparison of variables across contexts, while allowing some flexibility in indicator design at the local level, are optimal, suggested speaker Glenzer. Interventions that are successful in one country are not necessarily transferable to another country, noted workshop speaker Stoneburner. Examples provided by Stoneburner and speakers Latkin, Garnett, and Pulerwitz supported this statement. In some cases, factors independent of an explicit program intervention can have an influence on change. In other cases, change in behavior does not always lead to a change in the pattern of the HIV/AIDS epidemic, and changes in the pattern of the epidemic cannot always be translated to a change in behavior. Close engagement of the scientific community in evaluation, urged speaker Latkin, can help to assess the likelihood of transferability of effective programs to other settings.