National Academies Press: OpenBook

Evaluating AIDS Prevention Programs: Expanded Edition (1991)

Chapter: 6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs

« Previous: 5 Evaluating HIV Testing and Counseling Projects
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 124
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 125
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 126
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 127
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 128
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 129
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 130
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 131
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 132
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 133
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 134
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 135
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 136
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 137
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 138
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 139
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 140
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 141
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 142
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 143
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 144
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 145
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 146
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 147
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 148
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 149
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 150
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 151
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 152
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 153
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 154
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 155
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 156
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 157
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 158
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 159
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 160
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 161
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 162
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 163
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 164
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 165
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 166
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 167
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 168
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 169
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 170
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 171
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 172
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 173
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 174
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 175
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 176
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 177
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 178
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 179
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 180
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 181
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 182
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 183
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 184
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 185
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 186
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 187
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 188
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 189
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 190
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 191
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 192
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 193
Suggested Citation:"6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs." National Research Council. 1991. Evaluating AIDS Prevention Programs: Expanded Edition. Washington, DC: The National Academies Press. doi: 10.17226/1535.
×
Page 194

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

6 Randomized and Observational Approaches to Evaluating the Effectiveness of AIDS Prevention Programs In previous chapters, the panel recommended that randomized condoned experiments be used to evaluate a small number of important and carefully selected AIDS prevention projects. Our reasoning has been that we~- execute~ randomized experiments afford the smallest opportunity for error in assessing the magnitude of project effects, and they provide the most trustworthy basis for inferring causation. Notwithstanding this conclusion, we recognize that this strategy will not always be feasible and Hat nonrandomized studies may be required in some ~nstar~ces.~ In this chapter the pane] reviews a number of observational approaches to the evaluation of tile effectiveness of AIDS prevention programs. (~ addition, Appendix F presents a background paper for this chapter which provides a detailed treatment of an econometric technique known as selection modeling and its potential uses.) On January 12-13, 1990, the panel hosted a Conference on Nonexpenmental Approaches to Evaluat- ing AIDS Prevention Programs. Fifteen experts from the behavioral sciences, statistics, biostatistics, econometrics, psychometrics, and education joined panelists and federal representatives to discuss the application of quasi-expe~imentation and modeling to evaluating COC's three major AIDS interven- tions. This chapter is an outgrowth of those discussions and the papers presented at this conference (Bender, 1990; Campbell, 1990; Moffitt, 1990). 1 Observational designs may convey other practical benefits. For example, such studies avoid the ethical debate that may accompany the withholding of treatment in a randomized study, Discussed in Chapter 5. Observational studies may also be advantageous when an intervention occ~"naturally" or has saturated a community before randomization can be implemented. 124

RANDOMIZED AND OBSERVATIONAL APPROACHES OVERVIEW ~ 125 Determining the effectiveness of an intervention project requires compar- ing how a project participant (or group of participants) fares with how that participant or group would have fared under a different intervention or no intervention. Because such direct comparisons are ordinarily not possible,2 researchers have developed a number of ways to construct a comparison group that "looks like" the participants. The objective is to make this group as similar as possible with respect to confounding fac- tors3 that may affect the outcome lower than the fact of the intervention itself). If participants' selection or reason to enter into a study is not in- dependent of the study's outcome vanables, however, selection bias is introduced. For example, if individuals who enter a counseling and test- ~ng project are more highly motivated to change their risk-associated behaviors than individuals who do not choose to enroll in such programs, a selection bias is present, and the effects of the intervention cannot be estimated by simply comparing outcomes among program participants and nonparticipants. Strategies to evaluate AIDS interventions In such instances require the assumption that the effects of such confounding vanables can be estimated and adjusted for. (A variant of this problem can also arise in randomized experiments, where the attrition of respon- dents from expenmental and control groups can introduce an analogous selection bias.) As explained in Chapter I, selection bias can Menially be controlled by He random assignment of individuals to one group or another. When properly implemented, randomization will, on average, create groups that have the similar initial charactenst~cs and thus are free (on average) of selection bias. The chance always remains, of course, that randomized groups are different, but the chance is small and decreases as the sample size increases. Furthermore, standard statistical tools such as confidence intervals can be used In properly randomized experiments to calculate the variability in Be effect size associated with the randomization. Thus, well-executed randomized experiments require the fewest assumptions in estimating the effect of an intervention. Notwithstanding this statistical advantage, the pane! urges that underlying theory about who an interven- tion will affect (and how) be sufficiently compelling to justify mounting 2A few important exceptions anse, such as when the same subject can try two diets or two ointments. But if temporal sequence is important or if memory or attitude is at stake, "you can't go back." Simi- larly, only one of two alternative surgical procedures will be applied in any one patient, etc. 3 that is, variables that (1) influence outcomes, and (2) are not equivalently distributed in the treatment and comparison groups.

126 ~ EVALUATING AIDS PREVENTION PROGRAMS a randomized trial and sufficiently explicit about the relationship be- tween independent vanables and the outcome to allow the analysis of the experimental data if randomization fails and statistical adjustments are needed. Nonrandomized studies require additional assumptions and/or data to infer causation. This is so because it is seldom safe to assume Hat individuals who participate In a program and receive its services are similar to those who do not participate and do not receive services. In a few cases, the differences between participants and nonparticipants may be fully explained by observable characteristics (e.g., age, partner status, education, and so on), but often the differences may be too subtle to be observed (e.g., motivation to participate, intention to change behavior, and so on). This is particularly true in the AIDS prevention arena because so little is known about changing sexual and drug use behaviors. Thus, a simple comparison of the risk reduction behavior of participants and nonparticipants in a program can yield a misleading estimate of the true effect of the program. In addition to randomized experiments, six observational research designs will be discussed in this chapter. These alternatives use a variety of tactics to construct comparison groups.4 For organizational purposes, the panel clusters the six strategies under two umbrellas that differ in their general approach to controlling bias and providing fair comparisons. One approach involves the design of comparability and the other involves post hoc statistical adjustment:5 · Design approaches develop comparability a priori by de- vis~ng a comparison group on some basis over than ran- dom~zed assigrunent. This may be done through: · quasi-experiments, · natural experiments, and · matching. · Adjustment approaches correct for selection bias a poste- riori Trough model-based data analysis. Such approaches 4Throughout this chapter, control groups will refer to the randomly assigned groups that may either have received no treatment or have received an alternative to the experimental treatment. Their non- randomized counterparts will be referred to as comparison groups. SA third type of observational method is the case study, in which evaluators probe individual histories for factors related to outcome variables. Because little comparability is achieved in these studies, the panel will not discuss ~em, except to note that case studies often yield hypotheses and measures that can eventually be used in the other designs and can yield useful information to help interpret the results of randomized trials that are not optimally implemented.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 127 use models of the process underlying selection or partici- pation In an intervention and of the factors influencing the outcome vanableks). Specific methods include: · analysis of covanance, · structural equation modeling, and · selection modeling. It should be recognized that the panel's distinction between the two approaches is not absolute. Matching, for example, is sometimes done retrospectively as a method of controlling selection bias, and prospective data can be collected for use In modeling. Fur~e~ore, in some sense, all the approaches involve "modeling" at least to the extent that they take account of behavioral theories or models to infer causation. Despite some imprecision, the pane} finds the general distinction helpful in thinking about the ways that have been developed to estimate project effects from nonexperimental designs. Choosing Among Strategies Despite the panel's preference for randomization, we realize that it is not always feasible or appropriate and that its implementation is not immune to compromise. When randomization is infeasible or threats to randomization loom large, we believe researchers should look to alter- native strategies and determine which of them, ~ turn, will be feasible, appropriate, and produce convincing results. The choice of approach can present itself in many ways: a cohort study, for example, can begin to study all new entrants at a clinic (the a priori approach), or it can look back at all entrants who first appeared at that clinic at some tune in the previous months—if the clinic records are good enough.6 Because We data are already In hand, the a posterior) approach may permit an apparently faster investigation of the problem. Offsetting this advantage, however and often outweighing it are two other considerations. First, retrospectively collected data may not include measures of key vanables, and the available data may be difficult to interpret. Planned data collections may profit from steps taken to ensure the availability of ~nfonnation on all variables of interest. Second, design imposes control over the observations of behavioral and psychological variables: planning what data to collect, how to collect ~em, and what comparisons to make improves the prospects of obtaining valid and 6 Some quasi-experiments and nonexperiments may also offer the choice. If the trigger event is an earthquake, then only the retrospective mode is likely to be available, but if it is the initiation of a new legal requirement at a future date, then the choice of a prospective study is available.

128 ~ EVALUATING AIDS PREVENTION PROGRAMS reliable measures of project inputs and outcomes.7 For these reasons, the pane! believes that strategies that build on data collected for the specific purpose of evaluating project effects should, In general, have a greater likelihood of success. In the case of AIDS interventions In particular, the pane] is pes- simistic about our ability to correct for bias after the fact because we have at present a poorly-developed understanding both of the factors affecting participation In such projects and the factors that induce people to change their sexual and drug use behaviors (e.g., motivation, social support, and so on). Success through a posterior) approaches benefits from a comprehensive understanding of these confounding factors and reliable measurements of them.8 Finally, the pane] notes that the charged climate that surrounds many AIDS prevention programs can render decision making difficult in the best of circumstances. Research procedures that produce findings that are subject to considerable uncertainties or that provoke lengthy debates among scientists about the suitability of particular analytic models may, in the opinion of this panel, impede crucial decision making about the allocation of resources for effective AIDS prevention programs. These factors underlie the panel's preference for well-executed randomized experiments, where such experiments are feasible. This is not to say that observational strategies do not have a place in AIDS evaluation designs nor that their role must necessarily remain secondary in the future. Rather it reflects the panel's judgment that In the current state of our understanding of AIDS prevention efforts and the state of development of alternative observational strategies for evaluation, overreliance on observational strategies would not be a prudent research strategy where it is feasible to conduct well-executed randomized expenments. Before reviewing observational strategies for evaluation, He pane! provides a brief reprise of the basis for its recommendation that carefully executed randomized experiments be conducted to assess the effects of a small subset of AIDS prevention projects. RANDOMIZED EXPERIMENTATION Randomized controlled experiments specify a single group of interest of sufficient sized and Men randomly assign its members to a treatment 7 See Appendix C for a discussion of validity and reliability of behavioral data. Success through design approaches also depends on these things, but careful design allows some of the factors to be controlled. Randomization leads to the most trustworthy expectation that these factors have been controlled, although theory is important to examine whether groups are indeed comparable. 9 The question of sample size is important because it affects the statistical variance of the estimate of

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 129 group or a control group that receives an alternative treatment or no treatment at all. By randomly assigning units (i.e., individuals, schools, clinics, communities) to treatment and control groups, it becomes possi- ble in theory to interpret any resultant intergroup differences as a direct estimate of the magnitude of the effect, if any, induced by He treat- ment. The method's assumption that selection bias has been controlled is probabilistically hedged by the significance test. In properly random- ized experiments, statistical significance tests can indicate whether the observed differences In group outcomes are larger than can be explained by the random differences in the groups. By providing a statistically well-grounded basis for assessing the probability that observed differences in outcomes between groups are attributable to the treatment, we-execute randomized experiments re- duce ambiguity in the interpretation of findings and provide the greatest opportunity for producing clear-cut results. This reduction In ambiguity is made possible by the fact that assignment to a particular treatment group is by Edition independent of all other factors. This inferential strategy requires, however, that the random~zai~on of assignment not be compromised in its execution and that it be maintained through t~me.~° When assignment is not random or when randomization is compro- mised, differences between the treatment group and the condor group may result from either the effect of the treatment, or from idiosyncratic differences between the groups, or both. If, for example, members of the treatment group diner from those In a comparison group because they were more motivated and thus more aggressively pursued or stuck with the intervention program, the treatment's success may be overstated. On the other hand, if the treatment group represents those at highest risk, any comparison group would have the advantage of being composed of individuals less in need of the intervention. As such examples illustrate, selection bias can cause the intervention group to perform either "better" treannent effect derived from a given experiment. (Over things being equal, the squared standard error of this estimate will be directly proportional to the square root of the sum of the standard errors of the estimate of the means of the treatment and control groups.) Sample size is discussed in more technical teens in Appendix D, but, in brief, it should be noted that as the size of the sample increases, the variance in the expected distribution of estimated effects will decrease, thus permitting more precise estimates. (In addition to large sample sizes, homogeneous populations will also reduce variance.) 10In practice, nonequivalent addition in the treatment and control groups and over factors can re- introduce the selection biases that randomization excluded. When randomized assignment is thus compromised in execution, the same inferential problems that beset observational studies operate and they may require use of procedures such as statistical adjustments, modeling of attrition bias, and so forth. AT} such instances, it should be clearly recognized that the inferential uncertainties attending a severely compromised randomized experiment may be just as large (or even larger) than those that attend the use of a purely observational design.

130 ~ EVALUATING AIDS PREVENTION PROGRAMS or "worse" than the comparison group. The direction of the bias, let alone its magnitude, is often difficult to predict beforehand. The Power of Experiments: An Example History provides a number of examples of the interpretive difficulties Mat can attend observational studies (or compromised experiments) and the power of a well-executed randomized experiment to provide definitive results. In the infant blindness epidemic at m~-cen~y, for example, well-executed controlled experiments ended an inferential debate that observational studies had fueled instead of extinguished. In the 1940s and early 19SOs, more than 10,000 infants most of whom were born prematurely- fen victim to a previously unimown form of blindness caned retrolental fibroplasia (Silverman, 19771. Over We years, more than 50 hypotheses were offered for the cause of Me disease and for effective treatments. About half the hypotheses were examined observationally, but only four were actually tested in experimentally controlled teals. Before the expenmental studies took place, an uncontrolled study had indicated that He application of ACTH (adrenocorticotrophic hormone) would prevent the fibroplasia. A randomized controlled tnal showed that this therapy was unhelpful or worse: a third of the infants who received ACTH became blind whereas only a fifth of the control group did. One hypothetical cause of the observed blindness based on a study of 479 infants was a deficiency of oxygen. This proposal was coun- tered by another hypothesis based on 142 observations—that an excess of oxygen was to blame. (Duling the penod of the epidemic, premature infants were routinely given oxygen supplements at a concentration of more than 50 percent for 28 days.) Once again, a well-controBed ran- domized experiment put the debate to rest: the group of infants randomly assigned to receive the routine supplemental oxygen had a dramatically higher incidence of blindness than the control group (23 percent versus 7 percent. Observational studies might have finally yielded this same conclusion- at least one small study had suggested excess oxygen as He culprit but He human cost and the time involved (10,000 blinded children and more Han 10 years) were dear indeed. Because neither the cause of nor the cure for the children's blindness were known, the randomized trials reported here met ethical standards for varying treatments. loathe results of the study were widely publicized among ophthalmologists, and within a year the prac- tice of providing high concentrations of oxygen to premature infants was largely modified. Subsequent efforts have been made to provide an oxygen concentrate that prevents brain damage but does not cause blindness (Silverman, 1977).

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 131 Compromised Randomization The pane] believes that the inferential debates that bedevil the interpre- tation of nonexperimental studies are largely avoided by well-conducted randomized experiments. In practice, uncertainties may nonetheless at- tend the inference that a causal relationship exists between the inter- vention being evaluated and the outcomeks) observed In a randomized experiment. In this section, we discuss four important sources of uncer- tainty that investigators need to monitor: sample attrition, compliance with treatment, spillover (and diffusion of effects), and compensatory behavior. Note that the first three of these are not solely problems for experiments; they can frustrate observational studies as well. The last, however, is a special risk of randomized experiments. Attrition Careful randomization of participants into treatment and control groups is not sufficient In itself to guarantee informative results. Successful exper- ~ments also require that sample attrition be minimized. Any such attrition can introduce post-assignment biases in the composition of treatment and control groups. Two types of attrition can occur, each with different results. With one, participants drop out of He study and cannot be followed up. To the extent that this occurs, the ~ntegnty of the experiment is compromised and results are subject to some of the same concerns about selection bias that threaten the results of observational studies. If different plausible ways of analyzing the data lead to qualitatively different interpretations, it is then evident that: (~) the evaluator will have to mode] the self-selection bias, (2) the conclusions may depend on the mode! approach chosen, and (3) if no strong basis exists for confidence in the chosen model, the study results must be subject to considerable uncertainty. A second type of attrition occurs when people do not complete the protocol but are still available for follow-up. In this case a valid interpretable randomized comparison can still be made: outcomes can be compared between all those who started on intervention A and all those who started on intervention B. This comparison is sometimes mealiingflll because, in practice, the choice may be to start a participant In one intervention or another,-in full recognition that some participants may not stick with it. If defection rates are high, however, restricting analysis to only those who stay win the assigned treatment would produce wholly biased results. For example, selective drop-out may occur from experimental group A because project participation entails more effort than staying

132 ~ EVALUATING AIDS PREVENTION PROGRAMS in control group B. This type of drop-out introduces selection bias, with the result being that the outcomes of the expenmental group win be artifactually overestimated because the more motivated participants remained.~3 But some members of group B might also have dropped out had they been assigned to the program that required effort on the part of . . participants. Where selective audition occurs, differences in outcome between the two groups are inevitably an unknown mixture of effects related to the actual differences in treatment effects and differences In the kinds of participants who do and do not drop out of the two treatment groups. Even if selective attrition does not occur, the completeness of information may still differ systematically between treatment groups (especially where participant cooperation is necessary to information acquisition); again, bias from self-selection is a risk. Compliance Both the first report of the parent committee (Turner, Miller, and Moses, 1989: Chapter 5) and the preceding chapters of this report identify compliance, along with attrition, as major threats to the integrity of ex- penments. Even In the most carefully designed experiments, a substantial number of individuals may leave the program or fall to comply with the requirements of the experiment and thus not receive the full spongy of the intervention. The threat that attrition and noncompliance pose underscores the panel's sense that an essential first step of any outcome evaluation is to analyze the delivery of services before interpreting es- timates of project effects. Tracking respondents' compliance with the assigned treatment is essential to ensure that valid inferences can be Lawn from the data. The potential importance of tracking compliance is well illustrated by an example. Clofibrate, a drug intended to treat coronary heart dis- ease, was compared to a placebo In a very large clinical trial, and no significant beneficial effect was found.~4 Upon later analysis, however, it was observed that those patients assigned to Clofibrate who actually took at least SO percent of their medication had a much lower five-year mortality than those in the Clofibrate group who took less than 80 percent 130n the other hand, participants may drop out because their transportation falls through or they move away, which, one might expect, would not introduce selection bias. It is, however, the case that ad hoc inferences such as these are always open to challenge. "Transportation falling ~rough" may be a polite way for subjects to disguise their lack of interest in a program. l4In the trial, 1,103 individuals were randomly assigned to receive the drug, and 2,789 individuals received the placebo.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 133 of their medication. The mortality rates for the Clofibrate compliers and noncompliers were about .15 and .25 respectively. Note that this was not a randomized companson; the randomization put all these patients on Clofibrate rather than on placebo. These results appeared to show an important difference and suggested that the Mug had beneficial effects. Compliance—actual ~ug-taking—was, however, a matter of self- seiection. As it turned out, the group assigned to take the placebo also had "good" compliers (who took at least 80 percent of placebo) and "bad" compliers (who took less than 80 percent). Moreover, their five- year mortality rates were also about .15 and .25 (Coronary Drug Project Research Group, 19801. The effort to use the information available on the patients to account for this self-selection effect failed; the data in the records were not sufficient to adjust away the mortality difference in either group. Without the randomized control group data on compliance, however, a false treatment benefit could easily have been claimed for those who took 80 percent or more of the Clofibrate. While this example does not tell us how alternative methods might have been used to resolve the problem, it does clearly illustrate the importance of tracking self-selection and compliance. It also illustrates the usefulness of data from a randomized control group. Spillover The diffusion of treatment effects throughout the population can also obscure evaluation results. A major threat to the internal validity of the randomized experiment is "spillover." This phenomenon—the commu- nication of ideas, skills, or even outcomes from the intervention group to the control group~an result In the dilution of estimated program effects in a variety of ways. Members of an experimental group who adopt safer sex skills as a result of an intervention, for example, are likely to come into contact with members of the control group. If both groups are drawn from the same population, the control group may thereby adopt safer sex skills as well, at least when involved with individuals from the experimental group. Alternatively, an effective intervention may produce the outcome of lower infection rates among the experimental group; this outcome would then spill over into reduced rates among the control group because of the reduced pool of HIV-positive individuals to whom they could be exposed. In these situations, it is plausible that any observed difference between the experimental group and the control group is an underestimate of the program's Sue effect.~5 ISIS is unlikely that these rates can be adjusted to reflect initial conditions in different communities

134 ~ EVALUATING AIDS PREVENTION PROGRAMS Such spillover effects are less of a threat when the unit of randomized assignment is at the organizational level rather than at the level of the individual. As discussed in Chapter 1, the unit of assignment can be a citric (i.e., the clientele of the clinics, a community, or city, and so on. En fact, when thinking about AIDS interventions, it is apparent that many educational projects, such as the media campaign, are based on a diffusion theory that assumes that interpersonal contacts are made after media exposure. ~ such cases, organizational units are appropriate to study because spiBover within units is desired. Nonetheless, spillover across units can remain harmful to the evaluation effort, so geographic proximity of treatment and control groups should be avoided. Compensatory Behavior A problem unique to randomized designs is the threat that control group members will act in a way that compensates for their having been as- signed to the control group (and that is, In fact, different Tom the way they would have behaved if July '`untreated"~. Such compensatory be- havior can con~ninate the outcomes of an evaluation, and it is difficult to predict the direction of such bias beforehand. For example, if an attractive AIDS counseling project were offered to some participants but not to others (and both groups were aware of this assignment decision) the nonrecipients could react In different ways. They may overcompen- sate for their exclusion by taking it upon themselves to change their risky behavior or form their own support group. Such overcompensation would d~ sh the effects of a project detected by an evaluation. Or, nonrecipients may become demoralized and give up, not malting any change in their behavior or even backsliding to riskier ways. Such a reaction by the control group would Men tend to overestimate the effects of the intervention on the expenmental group. Such effects are particularly worrisome in that they can easily go unnoticed and result in misleading conclusions. Some protection against such missteps may be afforded by blinding the study so that participants are unaware of the alternate treatments a strategy that may be feasible when randomization is done at the clinic or community level. Use of ethnographic observers (See Appendix C) may also be helpful in recognizing the presence of such compensatory behaviors. Replication of experiments in different milieus may also protect investigators against such experunenta] artifacts. because we lack reliable data on We prevalence and distribution of HIV in the U.S. population. (See discussion in Chapter 1 of the 1989 report of Me parent committee [Turner, Miller, and Moses, 1989].)

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 135 Salvaging Compromised Experiments Extensive data collection can be of some help in salvaging randomized field experiments when the above factors reduce comparability between the treatment and control groups. If the randomized design is compro- mised, the estimates derived from the data will be subject to bias. For this reason, it can be a wise idea to collect data for randomized experiments as if the experiment were going to fad] in one of these ways. Expenments ought to be monitored so that reliable measures of the content of the intervention and relevant characteristics of respondents can be used to advantage in a nonexperimental analysis if systematic attrition does oc- cur. Both attrition and noncompliance will necessarily cause researchers to resort to analyses that use available data and a set of assumptions (a "model") to arrive at conclusions. Note that noncompliance becomes another behavioral variable to be modeled, as in the Clofibrate example given above. Close monitoring of experiments will enable researchers (~) to test whether the remaining members of the experimental and control groups are comparable on pretest measures, or (2) to use the collected data in a nonexperimental evaluation study where assignment or attrition are modeled. As discussed in more detail in a later section of this chapter and In Appendix D, this approach requires detailed data on project implementation, the project environment, and characteristics of participants and nonparticipants, such as demographic and socioeconomic data. Modeling the degradation of the experiment will, however, raise inferential problems because factors such as motivation to comply win be very difficult to measure. An alternative to modeling attrition is to try to prevent it in the first place. Given the great investment required for many interventions and their trials, it may sometimes be important to attain high levels of project completion to apply the maximum strength of the intervention. Three approaches can help foster both completion and compliance: a "runn~ng-in" period, indoctrination, and outreach. · To improve compliance a standard intervention could be of- fered during a running-in period to detect and reject individ- uals who drop out within the first few days or weeks. Those individuals who stick with the project can then be random- ized into an enhanced intervention or the standard protocol. Alternatively, before a study is fully deployed, would-be participants could be screened in a pre-~ntervention process In which the enhanced Intervention is delivered but not eval- uated, to see which individuals will comply with treatment.

136 ~ EVALUATING AIDS PREVENTION PROGRAMS In either case, only good compliers are allowed into the experiments.l6 · Indoctrination involves ~ns~uct~ng individuals when they enroll in a project about the expectations for the project. By supporting participants early in the project, when attri- tion rates are usually highest, investigators can often foster the understanding and trust needed to keep participation and compliance rates high and maintain a weB-executed experiment. · Finally, outreach efforts may be helpful to enhance comple- tion and compliance. By reaching out to individuals who have difficulty completing an intervention, investigators can improve compliance and encourage dropouts to resume par- ticipation if attrition occurs. The pane] suggests that the research protocol include outreach efforts to dropouts and poor compliers. Such special efforts to induce completion and compliance should be closely monitored. It should be noted, furthermore, that it may not be possible to maintain such special efforts when the intervention is eventually fielded, so effects estimated under the maximum strength intervention may tend to overestimate what will eventually be obtained. The pane' recommends that studies be conducted on ways to systematically increase participant involvement in projects and to retince attrition through outreach to project dropouts. All trials should assess levels of participation anal variability in the strength of the treatment provided. In addition to providing insight into how to control attrition, these studies may also be important In detenruning whether a lack of com- pliance win treatment contributes to its ineffectiveness. For example, a project or treatment could be constructed that, in itself, is entirely effi- cacious, but it is so unattractive or unpalatable to potential participants that compliance is negligible. A case in point is the treatment of al- coholism with disulfiram (Antabuse), a drug that induces weakness and extreme nausea when combined with alcohol. If taken, the drug works as intended; however, few alcoholics can be convinced to comply with Me treatment regularly (Fuller et al., 1986~. 16A disadvantage of this option is that some candidates who would have benefited from the study will not be admitted into the expenment.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 137 When Should Randomizes! Experiments Be Considered? The pane] repeats its strong belief that, if it is appropriate and feasible, a welI-executed randomized design is currently the most promising design for conducting an evaluation of the effectiveness of AIDS prevention programs. The questions of appropriateness aIld feasibility are discussed below. Is a Randomized Experiment Appropriate? Whether and when a randomized experiment is appropriate have to do with the research question being asked and whether an intervention project has reached the stage where such questions are timely. What Is Being Asked? In a process evaluation, when the question is "What is being delivered?," randomized designs are unnecessary be- cause no comparisons need to be made. In a formative evaluation (e.g., testing the persuasiveness of a set of story boards, as described in Chap- ter 3), experiments can be quite helpful, but other designs are certainly possible. Conversely, when the question is "What works better?," ran- domized experiments are particularly appropriate. Such cases involve the evaluation of relative effects, such as assessing different versions of a campaign message, educational project, or counseling regimen. The pane! has endorsed randomized experiments for such evaluations. Under clearly defined conditions, the panel has also endorsed randomized field experiments to answer the question "Does the project work?" This ques- tion requires that the control group not receive an intervention or receive it later than the experimental group. (Note that the latter design cannot be used when it would involve denying or delaying a treatment that is known to be efficacious. Ethical concerns are discussed at greater length In Chapter 5 and Appendix D.) Timeliness. The panel has previously recommended that process evaluation occur before an outcome evaluation of an intervention is attempted. A site is simply not ripe for outcome evaluation using ran- dom~zed controlled experiments until the project's implementation Is understood and is essentially stable. Moreover, an experimental tnal of an intervention is premature until reasonable grounds exist for believing that it may make a difference. It would be a mistake to commit precious resources to a rigorous study undertaken without sufficient theoretical or empirical grounds for hypothesizing the type of effect likely to result from a given intervention. Where theory is insufficient and an empirical base needs to be established, quasi-experimental studies and statistical models will be useful for initially assessing an intervention and providing some data against which to suggest treatment effects. Such preliminary

138 ~ EVALUATING AIDS PREVENTION PROGRAMS work could be especially important to provide a strong basis for exper~- mentally manipulating large social systems, such as communities. Thus, process evaluations and observational studies of proposed interventions can and frequently should precede a true experiment. Is It Feasible? The merits of the weli-executed randomized experiment for evaluating project effects are not a subject of much debate; however, the feasibility of such experiments is debatable. A randomized experiment is feasible if it is affordable, acceptable to the community, and random assignment is logistically possible. The pane} addresses each of these issues below. Affordability. The cost of an evaluation design depends on the scope of the planned research. The scope, in turn, includes such factors as the number of alternate treatments to be studied, the number and geographic dispersion of He sites, the number of study participants, the amount of information to be collected from each participant (and the difficulty involved In collecting that data), and the difficulty of the analysis. Cost involves dollars, personnel, and time as well as lost opportunities to improve human welfare and is obviously an important consideration for the sponsor of research. The pane! recognizes that high-quality evaluations can consume sub- stantial resources. Although it may appear that nonexperimental designs can save on costs, we believe this is not necessanly the case. The level of ngor, not the approach chosen, is the foremost determining factor of the cost of evaluation research. In addition, nonexperimental studies often incur cost beyond those of experimental studies. The cost of inconclusive nonexperimental studies during the infant blindness epidemic is illustra- tive. Because considerable uncertainties attend the interpretation of the results of nonrandomized experiments, firm inferences of causality may require additional labor on the part of investigators to rule out competing explanations, and widespread acceptance of the study's conclusions may be difficult to obtain. Even when an intervention's desired outcome is distant in time, the properly executed randomized experiment can save time and can provide widely accepted answers to controversial questions. A case In point is the treatment of breast cancer, for which survival rates are measured at some time distant from the intervention. In the late nineteenth century, Halstead introduced a form of surgery designed specifically to reduce the local recurrence of the disease by removing regional lymph nodes in addition to the cancerous breast. Because of its dramatic success in reducing the local recurrence of tumor, the surgery was widely adopted

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 139 and extended to the treatment of ax~lary lymph nodes. This "radical" mastectomy prevailed as the standard treatment of breast cancer for nearly 70 years, despite the fact that it was a disfiguring operation, required a lengthy recovery period, and in some cases, resulted in a prolonged disability. In 1971, a randomized Dial involving close to 1,700 patients was initiated to compare radical mastectomy with a less extensive form of surgery called total mastectomy. For 20 years prior, scattered anecdotal information was reported about the value of less extensive surgery, but the studies were sufficiently flawed as to only increase the controversy about the effectiveness of radical mastectomy in improving survival rates. The randomized dial demonstrated conclusively Mat radical mastectomy offered no survival advantage. This approach has changed dramatically the way patients are now Heated and also has changed the scientific understanding of He nature of the disease (Fisher et al., 1985~. Resources for Evaluation. Because the practical benefits of evalua- tion may accrue in the long term, the near-term perspective on evaluation especially during a health emergency such as AIDS may be short- sighted. A commonly held viewpoint is that current allocations for evaluation consume money that could be used to run additional AIDS prevention projects, a perception that can foster the resentment of eval- uation efforts among practitioners, sponsors, and recipients. To avoid this kind of resentment and He pressure to forgo evaluation in the near term so that projects can be deployed, we believe that the responsibility for funding evaluation should be separated from He responsibility for running He programs.~7 The pane' recommends that the Office of the Assistant Sec- retary for Health allocate sufficient funds in the budget of each major AIDS prevention program, including future w~de-scale programs, to implement the evaluation strategies recommended herein. Finally, the pane} repeats its advice that, to I~m~t the resources needed for evaluation, evaluation efforts be concentrated on projects Hat are believed to be important and Hat are technically appropriate—i.e., Hey are representative of an ~ntenention type, replicable In theory, and locally feasible to implement. (These criteria are discussed In Chapter 4.) l7Because the resources for fighting the epidemic are certainly limited, project costs are also important. It may be tempting to deliver inexpensive alternatives rather than their more expensive versions but, ultimately, an evaluation that demonstrates which project has the highest dividends ought to lead to a more economical allocation of funds.

140 ~ EVALUATING AIDS PREVENTION PROGRAMS Acceptability. Community acceptance of evaluation vanes. The panel recognizes that evaluation research of any kind may be suspect In some communities. In such situations, communities may single out randomized experimental designs as particularly unattractive, for several reasons. Because of a diversity of community viewpoints, a number of ways may have to be tned to make randomization more palatable. One objection to experimentation may involve perceptions that ~n- vestigators are unmindful of the needs and constraints of project admin- istrators and, as a result, the affected community. A way to improve an exper~ment's quality and at the same time increase its acceptability is to involve project a~n~rustrators and practitioners In its design. Enlisting practitioners' insights into a project's goals and operations ought to lead to better research by ensuring that its design is implemented as planned, by understanding if—and how—a design needs to be altered, and by communicating goals to project participants. Such a strategy should go a long way toward: (~) improving the research, (2) building up pools of evaluation expertise, (3) engendenug support and derailing m~sunder- standings that could otherwise lead to nonrandom~zed assignments, and (4) allaying perceptions of a project's or community's "guinea pig" status. Another major objection to the use of controlled trials is the public perception that eligible populations are being denied services. Yet, when resources are scarce and need is widespread, assignments made on a random basis (e.g., through a "lottery") should be less objectionable than assignments made on almost any other systematic basis. In addition, the need for an intervention can be taken into account. Randomization does not necessarily mean that every respondent has to have an equal proba- bility of assigmnent to treatment; rather, stratified random assignment or other probability-based allocation can be used. Stratification involves dividing a population into mutually exclusive groups, or strata, and randomly selecting sample units from among them. This procedure ensures that certain strata theoretically related to an out- come are included in a sample In sufficient numbers to analyze Rem statistically. For example, a community intervention to educate gay men in Me negotiation of safer sex may stratify its clients according to whether or not they are in stable relationships, if theory predicts Rat pre-existing relationships are important. Half of the clients in each stratum might then be assigned at random to group A or group B.18 When outcome data are analyzed, variance attributable to stratum membership can then be controlled and reduced because the organization of the population into homogeneous subsets reduces sampling error. 18Note that treatment and control groups need not be of equal she.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 141 At Me same time, stratification can also be used to favor groups win greater needs in a way that tips the probability of treatment assign- ment to them, thus making randomization more attractive. Stratification can delineate groups with greater needs e.g., those who inject cocaine instead of or in addition to opiates -and attach a probability of assign- ment greater than .50 that they will receive a treatment or an enhanced intervention. For example, a stratum of cocaine injectors might be fa- vored at .75, so that three quarters of its members receive the enhanced intervention, whereas the odds of assignment could be reversed (i.e., .25) for the stratum that inject opiates only.20 Providing alternative or delayed treatment is another way of ~ncreas- ing the acceptability of randomization, as the panel has pointed out in previous chapters. A recent example of alternative treatments is provided by Valdisern and colleagues' (1989) evaluation of the effects of two peer-led interventions to reduce risky sexual behaviors. A sample of gay men in Pittsburgh was randomly assigned to receive either a lecture on safer sex or a lecture and a skilis-trairung intervention In which safer sexual encounters could be rehearsed. Men who received skills trading had a higher rate of condom use at follow-up than men who received information only in the lecture. An evaluation design such as ValdiselTi's answers the question "What works better?" but not "Does it work?" To increase the acceptability of randomization to answer the latter question, the pane] has suggested delayed project implementation. A wait-list condition, which only tem- porarily withholds treatment or services from control group members, may be more palatable to some project a~rninistrators or their target au- diences than no treatment at ah. For example, Coates and colleagues (1989) used a delayed treatment condition In Weir evaluation of a project designed to reduce risky sexual behavior. Investigators recruited 64 seropositive gay men meeting con study criteria and randomized half to receive stress management training and the over half to a waiting list. They found Mat the experimental group reported a mean of 0.50 sexual partners at post-treatment, compared with 2.29 partners for the control group (baseline means were 1.41 and 1.09 partners, respectively). Some people have suggested that withholding or delaying treatment is unprincipled. The objection to withholding services assumes. of course. HA New York study indicates that cocaine is injected more frequently than opiate drugs because of its relatively short-lived effect; in addition, it is thought that needle hygiene decreases over the course of an injection session (Friedman et al., 1989). 20When program effects are estimated from strata win different probabilities of assignment, the esti- mates will require weighting by the inverse of the selection probability if one wishes to generalize the estimate of effect back to the population that was randomized.

142 ~ EVALUATING AIDS PREVENTION PROGRAMS that the services are known to be effective. The pane] has already stated its position that withholding effective services is unethical; testing whether services are effective, however, is ethical. (The ethics of randomized experiments are discussed In Chapters 4 and 5 and Appendix D; the related question of confidentiality is also discussed in Appendix D.) A final suggestion to increase the acceptability of a randomized study involves offering new or improved services to participants as inte- gral parts of research and demonstration projects. By malting grants or contracts available to fund such services with a ngorous, well-designed controlled trial, the demand for increased services can be linked to the need to gather convincing evidence of their effects. Successful examples of this linked approach are projects funded by the National Institute on Drug Abuse and the National Institute on Alcohol and Alcohol Abuse: e.g., a psychotherapy trial for cocaine abuse, two research and demon- strat~on projects for drug treatment improvement, and collaborative trials of patient-treatment matching. These initiatives provided substantial funding for services that are part of a research protocol. The panel believes that by coordinating such grants to require replicable service protocols and standard instrumentation, these collaborative studies will provide opportunities to systematically investigate more services and combinations of services than would be possible with typical s~ngle-site, single-~nvestigator models. The pane' recommends that new or improved AIDS preven- tion seances be implemented as part of coordinated coliab- orative research anti demonstration grants requiring con- trolled randomized trials of the effects of the services. Such randomized trials would be affordable and would address the right kinds of questions i.e., "Does it work?" and "What works better?" The pane] believes that broad and insurmountable ethical barriers to randomized experiments do not arise except when the use of no-treatment controls is considered; we note that any particular study, however, may raise idiosyncratic ethical questions that must be resolved before this recommendation can be implemented. Logistics of Randomized Assignment. The logistics of random~za- tion entails the careful assignment of study participants to subsets that receive a treatment or its alternatives In such a way that every participant . 21The National Institute on Drug Abuse details the cocaine psychotherapy trial in announcement DA- 90~1 and the drug treatment research and demonstration projects in DA-89-01 and DA-90 05. The National Institute on Alcohol and Alcohol Abuse details Me patient-treatment matching trials in an- nouncement AA-892A.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 143 has a Imown and nonzero chance of being assigned to a given subset. For example, if two equally-sized subsets are going to be constructed, participants can be assigned by tossing a fair coin if the toss is heads, he or she is assigned to group A; if it is tails, he or she is assigned to group B. If there are more than two subsets, a die might be tossed, cards may be shuffled, spinners may be turned. Alternatively, a random number list or computenzed random number generator can be used to assign participants to different subsets depending on whether odd or even numbers turn up. A possible pitfall of the randomized controlled tnal is faulty imple- mentation. This trap can open up by misassignment of project participants at the beginning of a study, which can occur in a number of ways, both inadvertently and intentionally. Sometimes the randomization mechanism is simply faulty or its use misunderstood.22 Perhaps a greater threat is intentional misassignment. Efforts are needed to forestall the opportunity for anyone involved with the random- ization investigator, administrator, participant to change the assign- ment in any way (or to influence it in advance). These opportunities can occur when the interviewer angler the participants are aware of the details of the experiment. In medical trials, the experiments should be "blinded"—i.e., participants should not be aware of the condition for the experiment, and the health care provider should either be unaware of the treatment condition or of who is receiving the treannent.23 Blinding is not always possible in social experimentation, but attempts should be made to avert Me possibility of clients' knowing and changing partici- pant assignment or behaving in a "socially desirable" way. Take, for an example, the tra~n-the-~a~ners Interventions Mat are frequently offered by CBOs. In such cases, a desired outcome could be subsequent knowI- edge or behaviors of the trainers' clients; to avoid the appearance of speciainess, the clients should not know whether their trainer received the intervention or not. Another Togisi~cal problem has to do with the eligibility of partici- pants. Obviously, the feasibility of a randomized experiment is greatly diminished if a candidate population has already been exposed to the 22The device used to randomize assignments should be tested, understood, and correctly used. For example, researchers must comprehend how to use published random number lists, or if a computerized random number generator is used, a new seed must be selected every time a new list of numbers is created. 23Blinding can also be successful at reducing the effects of knowing participation status and the ensu- ing "Hawthorne effect" (see, e.g., Maxwell and Delaney, 1990). As mentioned in Chapter 1, this is the psychological effect of knowing that one is participating in an experiment, which may cause people to behave differently Han they would in a natural setting.

144 ~ EVALUATING AIDS PREVENTION PROGRAMS program in question or to a similar program, thus thwarting the prospects of a condor group. For example, if a community-wide AIDS intervention project has already "saturated" the community, the random assignment of individuals or communities to expenmental and control conditions is virtually precluded. (If records are good enough, however, another evaluation design may be possible such as the interrupted time series analysis, to be discussed in the next section on quasi-expenments.) It is because randomized experiments are not always feasible or appropriate that we look to alternative methodologies. Despite some inferential ambiguity with their results, the pane! believes it is often preferable to pursue a nonrandomized study if carefully conceived and implemented than to forgo an evaluation because randomization is not possible. As discussed earlier, we divide the alternative approaches to outcome evaluation into two camps: those that design comparison groups on an a priori, nonrandomized basis and those that develop comparability through post hoc statistical adjustments of the data. We turn now to the group of strategies that design comparison groups on a nonrandomized basis. DESIGNING COMPARABILITY INTO NONRANDOMIZED STUDIES One genera nonexperimental approach to control selection bias is to build group comparability into a research design before collecting and analyzing the data. In such planned comparisons, the panel believes it may be possible to control some of the confounding factors that could otherwise produce spurious outcome differences between groups. The pane! cautions, however, that the usefulness of such designs in evaluating AIDS projects may be hobbled by inadequate information about the relevant determinants of outcomes and the factors that affect selection into projects. In this section, the pane} addresses the use of quasi-experiments, natural experiments, and matching for the outcome evaluation of AIDS prevention programs. Potential sources of data for nonrandomized de- signs will also be discussed in this section. Quasi-Experiments When randomized assignment to treatment is not appropriate or feasi- ble, it may be possible to use quasi-experimental designs to estimate

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 145 the direction and approximate magnitude of project effects.24 These stud- ies attempt to build comparability into comparisons through a deliberate design that may permit the inference of a treatment effect along with estimates of its size, given some assumptions that may not be too diffi- cult to justify (see Campbell and Stanley, 1966, and Cook and Campbell, 19791. In the following section the pane! discusses the conceptual foun- dations, assumptions, data needs, and possible inferences of time series and regression displacement/discontinuity designs. (Some of the time se- ries and regression examples used here involve the analysis of "natural" events; natural experiments will be further discussed In a later section.) Interrupted Time Series In the interrupted time series design, a number of observations are made over time on an outcome measure of interest for a well-defined group. During the course of Me measurements, the intervention (i.e., "the ~n- terruption") is introduced, and We schedule of observations continues. The resulting time series can then be examined for shifts In the outcome measure, Me crucial question being whether an effect appears simultane- ously or soon after the treatment or interruption.25 In this case, the group who receives the intervention acts as its own comparison group, i.e., by comparing the group's records before and after Me intervention. In some cases, the timing of the interruption is controlled by the in- vestigator, who can Men design the quasi-experiment around the man~pu- lated interruption. This option may apply in the evaluation of some AIDS prevention projects. In other cases, the interruption occurs naturally Mat is, a situation arises in which some individuals will receive an interven- tion and others will not, for reasons apparently unrelated to Me outcome vanable. Although investigators cannot condom such "natural" ~nterrup- tions, they can sometimes anticipate them (as in Me case of pending legislation) and begin to collect relevant data before the event occurs. Other times Me natural interruptions are unforeseen; if records are good enough in these cases, time series analyses may be feasible. A good way to describe Me interrupted time series design is by idus~ation. One example is provided by the national 55 Mph speed 24In Chapter 4, the panel noted that a quasi-experiment might be useful to test the null hypothesis that no change occurs when it can be confidently assumed that uncontrolled factors will have a positive effect on outcome, if they have any effect at all. We should caution that a quasi-experunent that does not reject the null hypothesis has not in fact settled the issue of whether an intervention is effective. If good theoretical grounds exist for believing that the intervention should make a difference, a randomized experiment may still be needed to test effects. 25A good test for identifying a distinct change in the series of measurements is given in Box and Tiao (1965).

146 ~ EVALUATING AIDS PREVENTION PROGRAMS limit adopted by Congress In 1974. Shortly after this legislation was enacted, investigators in Texas estimated its effects using a time series model of state highway fatalities. Examining monthly records from 1972-1974, investigators found a 57 percent reduction In Texas fatalities attnbutable to the new law (Transportation Research Board, 19841. This analysis could be made only because suitable records were available from previous periods, although this analysis did not take into account competing hypotheses for change. A related design is the multiple time series approach. This method uses one or more comparison groups or areas where the intervention is not introduced. Outcome measurements are made of the treated area concurrently with outcome measurements of the comparison areas. This method works better Man the single-site approach at ruling out competing hypotheses for arty observed changes. Consider, for example, a multiple time series Mat was applied in the 1970s after a methadone maintenance clinic was closed In Bakersfield, California. Investigators took advantage of this involuntary "intervention" by following up the clients of the defunct treatment center for two years and comparing them with clients of a clinic that continued to function. Both groups were measured for readdiction, arrest, and incarceration. Investigators found that clients of the closed clinic fared more poorly than clients of the open clinic (McGIothlin and Anglin, 1981~. A more recent study used both single-site and multiple time series analysis to test He effects of anonymity on the number of gay men seeking HIV testing. Prior to December 1986, all public HIV testing In Oregon was done confidentially. At that point, clients of public testing centers were offered a "new" intervention anonymous testing. In the first four months following the intervention, the demand for HIV testing increased by 125 percent on the part of gay men. By comparing the number of test takers on a pre/post basis, a time series was constructed that used the history provided by pre-intervention observations as a way to rule out over explanations for observed changes. Second, by comparing the num- ber of individuals seeking HIV testing across several sites, investigators were able to test the hypothesis that demand among gay men would have remained constant had anonymous testing not been available. Natural comparison groups that did not receive the new intervention were users of anonymous private test sites in Oregon, anonymous public sites in California, and confidential public sites in Colorado (Fehrs et al., 1988~. Assumptions. When no comparison groups are used, He history provided by the before-treatment observations becomes a control mecha- n~sm, allowing estimates of the outcome that might have been measured

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 147 without the treatment. The validity of inferences from the interrupted time series design rests on the assumption that no competing hypotheses plausibly explain the shift in the time series that occurs after the inter- vention (competing explanations include historical events, maturation of individuals, changes in record keeping, the sensitizing effects of test- ing on participants, and so on).26 This assumption of a good control is vulnerable when compared to well-executed randomized experiments. Alternative hypotheses, for example, had to be considered in evalu- at~ng the effects of the national 55 Mph speed limit law. Using a time series method to analyze the effect of the legislation nationwide, inves- tigators examined national statistics of highway fatalities between 1970 and 1979. Their initial analysis indicated a reduction of 10,400 fatalities per year, an estimation that was later substantially revised downward. The magnitude of effects was significantly reduced when the following competing explanations for change were observed: (1) a concurrent his- torical decline In fatalities which was discovered when a longer time series was used, (2) a reduction in discretionary travel coinciding with the Arab of] embargo and ensuing fuel shortages, and (3) Improvements in highway and auto safety. Taking these other factors into account, the spleen limit law appears to have saved about 3,200 to 4,500 lives per year rather than the 10,400 lives Initially estimated (Transportation Research Board, 19841. Data Needs. Compared to randomized experiments, interrupted time series designs typically require more data to infer causation. Particularly in the context of AIDS preventions, a wide range of data may be needed to test for the effects of changes in project recruitment, of other AIDS programs, and of contemporaneous events. Info~ation about these types of changes occuning during the study is essential because such changes may provide alternative explanations for a shift in a time senes. Infor- mation about them also helps to mode] the behavior of Me time series and may serve to reduce Me size of the error term In the statistical model. In the presence of such confounding factors, it may not be possible, however, to disentangle me effects of a particular intervention in a quasi- expenment from other events that occur at the same time. To circumvent this problem, the pane! suggests, whenever feasible, collecting data on multiple comparison groups, multiple indicators of an outcome, multiple interruptions, and multiple measurements over time. Multiple comparison groups may be used to rule out history and venous fonns of selection bias as plausible explanations for changes In With the multiple time series design, the use of multiple comparison groups helps to verify that history is not a plausible explanation for change.

148 ~ EVALUATING AIDS PREVENTION PROGRAMS the outcome vanable. Multiple indicators—or"nonequivalent dependent variables," as they are sometimes known refer to variables that are expected to be changed by a particular intervention as well as vanables that theory predicts should not be changed by the intervention. For example, a neighborhood project that distributes bleach and teaches safer injection practices can anticipate (~) changes in methods of cleaning drug paraphernalia, but not (2) changes in the frequency of drug use. If the investigator measures both indicators, however, and the time series indicates changes in (1) and not in (2), it would provide a clearer inference that the project and not some other intervention or historical trend was leading to change. This strategy of measuring multiple indicators will not always be possible, but it is worm trying when feasible. Analysis of multiple interruptions can provide additional certainty in inference; although multiple interruptions are not typical, Hey should be exploited whenever possible. For example, Henn~gan and colleagues (1982) were able to devise a time series with two separate interruptions to add plausibility to their surprising finding of the effect of television on crime. Using an interrupted time series design with "switching replica- tions,"27 investigators discovered increased larceny rates following two separate local introductions of television in 1951 and 1955. Because of a freeze on new broadcasting licenses ordered by the Federal Commu- nications Commission, television was introduced in this country on a staggered basis: some communities gained access to television before the freeze was initiated, and others had to wait until He freeze was lifted. Although investigators found no Increase in violent crimes, burglary, or auto theft, they did find consistent increases in larceny with the introduc- tion of television, both in prefreeze locations relative to postfreeze locales and again in the postfreeze locales relative to the prefreeze areas. This increase was observed for both the comrnun~ties that received television in 1951 aIld again for the communities that received television in 1955.28 Multiple measurements over time may also increase our confidence In He results of a quasi-experiment. Emmett (1966) notes that In evalu- ating radio and television messages, repeated measurements before and 27A switching replication design usually involves two groups acting as comparison groups for one another: two interventions are provided the two groups at the same time, and the interventions are switched after a prescribed period of time. (~e design used by lIennigan and colleagues can also be said to involve switching replications.) 28 Content analysis of early television indicated a preponderant depiction of upper- and middle-class lifestyles, Me preponderant advertisement of consumption goods, and a subordinate portrayal of larceny relative to violence in crime shows. The authors tentatively attributed the increase in larceny to fac- tors theoretically associated with viewing high levels of consumption i.e., explanatory theories of relative deprivation and frustration—rather than to factors associated with the social learning theory of larceny.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 149 after a broadcast are a good way to increase the confidence that changes are attributable to the broadcast and are not short-lived. Data collection in such designs may begin many months before the intervention starts and continue for several months or even years after the project's conclusion in order to obtain a series of baseline and post-interruption measurements. It should also be added that time series design requires stable instrumen- tation to avoid misconstruing changes in measurement procedures with changes in outcomes. Inferences. In interrupted time series analysis, the comparison group is the same community as it was in the recent past, and any systematic changes In the outcome variable are modeled as a time series (possibly a nonstationary series i.e., one that exhibits a secular trend). Potential problems that may arise in such time series analysis include autocorrelated error,29 the effect of repeated measurements on the sample, too few data points In the time series, and the fitting of overly complicated models to the data. The shortness of either time series (before or after the intervention) lessens the strength of the inference, and modeling becomes more diffi- cult. For example, Boruch and Riecken (1975) found that an evaluation of nutrition and education programs in Cali, Colombia produced "dras- ticaRy" biased estimates of program effect because it was based on an overly short time series. In this case (see McKay, McKay, and S~rusterra, 1973), the time series estimate of the program's effect on children's cog- nition was half the size of the effect estimated in a randomized test. The reviewing authors posited that the bias would have been smaller had a longer time series been available. To increase one's ability to model the process, multiple time series analysis is preferred. By using another area that does not receive the intervention as an additional comparison group, a (partial) control on the effects of history is added. In the context of AIDS research, the panel believes Mat multiple time series analysis may provide a useful method for evaluating the effects of community projects or the media campaign when randomized experiments are not feasible. Regression Discontinuity or Regression Displacement These quasi-experimental designs are similar in concept to Me interrupted and multiple time series designs discussed above, but they do not require 29 In a regression equation, the error tenn represents the difference between the real value of the out- come and its predicted value. Autocorrelated errors occur when the error term at time 1 is correlated with the error term at tune 2, and so on.

150 ~ EVALUATING AIDS PREVENTION PROGRAMS the same assumptions about selection factors because the basis for selec- tion is deliberately designed into treatment eligibility. On both regression designs, a group is assigned to an intervention on the basis of a decision variable that may or may not be related to the outcome vanable, and the other group or groups are assigned to be the comparison. Campbell (1990) provides a helpful example of the regression dis- placement design.30 As shown in Figure 6-1, he looked at the effects of a "natural" ~ntervention-Medicaid-on the number of times individuals visit physicians, comparing the number of doctor's visits made by six groups of individuals with varying levels of income (data reported by Wilder tI972] and Lohr [19721~. Individuals earrung $3,000 or less were entitled to Medicaid; the others were not (thus delineating the decision criterion). When compared with visits made by the other groups (for whom the number of doctor's visits increased as income Increasers, the number of doctor's visits by the poorest group was "displaced" after the intervention was introduced. In regression displacement analysis, a single regression line is fit for the comparison groups, which excludes the expenmental groups or areas from the analysis. The regression line is fit after the intervention by regressing posttest scores on pretests.31 A test is Men made to see if the experimental group belongs along the regression line or is significantly "displaced" from it. In regression discontinuity analysis, separate re- gression lines are fit for the groups' outcomes; if the intervention has an effect, a "discontinuity" between the lines should appear at the decision point. (See Mood [1950] for a t test of the significance of a regression displacement point and Cook and Campbell [1979] for a significance test for regression discontinuity.) It should be clear that many regression displacement and regression discontinuity designs completely confound their treatment (e.g., Medi- ca~) with a particular level of the sorting variable used to determine eligibility for the treatment (e.g., incomes below $3,000~. In theory it is possible that post-intervention evidence of"displacement" or "dis- continuity" could reflect the influence of the sorting vanable and not the intervention. Evidence to rule out such courter-hypotheses can of- 30Regression displacement is a new name coined by Campbell (1990) for an old but largely neglected design. Cook and Campbell (1979:143-146) called the method the "Quantified Multiple Control Group, Posttest Only Design," and Riecken and Boruch (1974) called it the "Posttest-Only Compari- son Group Design." Fleiss and Tanur (1973), Ehrenberg (1968), and Smith (1957) were also among the methodological progenitors of the design. 31The regression of pastiest on pretest measures is not strictly necessary. Any variable can be used to set a decision point between comparison groups as long as it is theoretically related to the outcome variable of interest.

4.8 - 4.6 ~ a) Qua ~ Cal ._ ~ in > 0) 4.4- ~, ._ . Cal ._ [L 0 a' Q ~5 of 4.2 - 4.0 4 RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 151 A F E! / E / El D / , . . . B 13 C / A=$15,000 or more B=$10,000- 14,999 C=$7,000- 9,999 D=$5,000- 6,999 E=$3,000 - 4,999 F=Under $3,000 E] ~ 5 Number of Physician Visks per Year, July 1963-June 1964 FIGURE ~1 Regression displacement analysis of the effects of We introduction of Medicaid in 1964. SOURCE: Data points by Wilder (1972); displacement analysis by Campbell (1990).

152 ~ EVALUATING AIDS PREVENTION PROGRAMS ten be adduced from pre-~ntervention studies (see, for example, Lohr, 1972~. The inference that the treatment itself is causing the displacement or discontinuity will be clearest when the same function fits both the pre-intervention data and the untreated portion of the post-~eatment data. Although regression discontinuity and displacement designs have not been frequently used to evaluate large-scale social programs, some attempts have been published. For example, Berk and Rauma (1983) employed a regression discontinuity design to evaluate the effects that paying unemployment benefits had on recidivism of ax-offenders. The decision point for receiving benefits was based on Me number of hours an ax-offender had worked in prison. Using this method, the authors found that the group receiving We intervention had a significantly lower recidivism rate. To the best of the panel's knowledge, no evaluations using these designs have been made of AIDS prevention programs, so their value has not been established in this area. Assumptions. Regression discontinuity and regression displacement designs assume that the intervention group and comparison group or groups come from the same population so that, except for their in- tervention eligibility, idiosyncratic differences between individuals are distributed more or less equally among the population. The assump- tion that selection is thereby controlled is not verifiable, and Be pane! warns that other factors may vary between the groups, and they may be important determinants of the outcome. Both regression displacement and regression discontinuity designs also rely on the usual assumptions underlying typical regression models, including homogeneity of error vanance, stability of the regression coefficients over the drne period, and the adequacy of functional specification. These assumptions cannot, in many circumstances, be vended. Data Needs. In regression discontinuity and regression displacement designs, additional data are required to make a causal inference plausi- ble. Moreover, it is important to have reliable measures of the decision variable that is used to determine a cut-off or demarcation point between treatment and comparison groups. Measurement is crucial because an in- dividual's score on the decision variable decides his or her participation in the intervention. Unfortunately, scoring is fallible, which means that the basis for an individual's selection into a project can be mistaken. For example, in Me Medicaid study, scoring could be invalidated if persons underreported their Incomes in order to participate In the program. Inferences. If the assignment mechanism is known and measured, it becomes more plausible that intervention effects can be isolated. It is important that the point of demarcation-be fixed between the intervention

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 153 and comparison groups in regression discontinuity and displacement de- signs, because any mixing of the groups about the decision point makes inference more difficult. Note, however, that a purported advantage of these designs- their ability to target an intervention to those areas In greatest need—can be a disadvantage to the extent that the effects of the intervention are not generalizable to groups with different levels of need. Further, as previously noted, the confounding of the assignment variables and the intervention requires that other evidence be adduced to support the inference that it is the intervention that is inducing We displacement or discontinuing. Another potential problem with the designs is that competing hy- potheses for change cannot always be convincingly ruled out. In the Medicaid illustration shown above, for example, potential sources of selection bias included, among other things: (~) the underreporting of income to be eligible for the program, (2) a disproportionate demand for medical care among the poorest group, and (3) the effect of a program requirement for medical consultations. Such potential sources of bias can be checked against archival records, if Me data are good enough (in this case, data on medical visits collected before and after Medicaid came into effect was able to rule out the hypothesis that the poorest group had a disproportionate demand for medical attention (see Riecken and Boruch, 19741~. Furthermore, the true functional form of the relationship between the decision variable and the outcome variable in these designs cannot be verified, as will be shown in a hypothetical plot shown in the next section. Nonadditive effects (interactions) of the selection variable and other unmeasured variables may, for example, underlie the observed bivanate association. Multiple studies undertaken prior to the introduction of the intervention may help, however, to verify the functional form of the relationship. Notwithstanding these inferential problems, the pane] believes that these designs are worm trying if multiple studies can be made. Regression displacement or regression discontinuity might be useful designs, for example, to evaluate prevention projects such as those conducted by community-based organizations. To ensure Mat their interpretation goes more securely, however, we advise that such studies should be replicated elsewhere to provide some test of the robustness of their conclusions. Existing Data Sources for Use in Quasi-Experimental Designs The pane] believes Mat quasi-experimental and nonexperimental designs may be useful in the event that randomized experiments are not feasible.

154 ~ EVALUATING AIDS PREVENTION PROGRAMS In this section, the pane] discusses some existing data sources that might be used in quasi-experimental studies. Such current efforts to collect data on HIV infection and on the public's knowledge, attitudes, and beliefs about AIDS provide observations Hat might be used to support interrupted tune series or regression displacement designs. The Neonatal Screening Survey. CDC/NIH's newborn screening survey may offer an opportunity for a local quasi-exper~mental evaluation of the effectiveness of a media or community health education project. The newborn survey, now being implemented In 44 states, the District of Columbia, and Puerto Rico, conducts blinded HIV antibody tests on heel stick blood specimens obtained from 50 to 100 percent of newborn infants (depending on the state). The resulting data provide evidence of the level of HIV infection among the population of women giving birth because a newborn infant carries maternal antibody to HIV whether or not the infant is infected. The great advantage of the newborn data is Hat they provide an unbiased estimate of HIV prevalence among all women who bear children in a particular time period. (They do not, however, reflect rates among women who abort their pregnancies or successfully practice contraception.) If the newborn infant seroprevalence survey is expanded, as currently planned, to gather data on 100 percent of hospital births, the quantity of data should permit estimates of infection to be made for even relatively small communities and for different age groups and separately for blacks, whites, and Hispanics within the cornmunity.32 These estimates may then allow relatively narrow geographic areas to be targeted for intervention. It is reasonable to expect that expanded data from He survey will reflect variations in the overall patterns of HIV transmission that occur via heterosexual sex and drug use in a community. However, as noted in Chapter 2, HIV prevalence rates are less than ideal indicators In some respects. They do not provide, for example, rapid signals that protective changes in sexual or drug use behavior are occulting, and they provide little or no information on communities where the HIV virus is not wed seeded in He population.33 Furthermore, these data will not reflect HIV transmission among men who have sex win men. Thus, although the newborn seroprevalence survey may provide important opportunities for quasi- and "natural" experiments, the range of its application wiB be somewhat narrow. 32 Sufficient precautions must be taken to prevent the inadvertent identification of individuals from the survey; e.g., by masking data in small non-zero cells. 33In a closed population with a zero prevalence rate, prevalence will remain unchanged whether or not protective behavioral changes are adopted.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 155 The panel also has some misgivings about using seroprevalence rates as the outcome measure of any AIDS prevention project because, among over Wings, a long interval of time is required to net enough occurrences of seropositivity to be able to test project effects (see Chapter 2~.34 On the other hand, the panel believes that such rates may be useful in evaluations of community education projects aimed at reducing pregnancies among HIV-positive women (for example, health education projects delivered in community family planning clinics and Cough local, highly visible media campaigns). These data are more attractive because the interval between He intervention and the indicator of its effectiveness is shorter (nine months) and because the rates could reflect the desired outcome- a reduction In pregnancies. Analysis would, of course, require controls for age. So, for example, if the survey were to include mother's age to the nearest year, the prevalence of infection in this year's 18-year-old mothers could be compared with the prevalence among next year's 18- year-old mothers. The effectiveness of a campaign might manifest itself In a lower prevalence of infection in locales with high-intensity education projects than in locations without such projects and In a lower rate of births to women who were HrV seropositive. It may be possible to use data from the CDC/NTH newborn survey to evaluate such an intervention with a regression displacement design, for example. A decision point for which communities receive the intervention can be established on the basis of the incidence of HIV infection found from the newborn survey. Following the intervention, seroprevalence rates can be examined to detect the effect of the intervention on the treated groups. The comparison groups in such a regression displacement design would be those locales that did not receive the intervention. The following is a simple, hypothetical illustration of such an eval- uation; the illustration does not take into account He possibly nonlinear grown rates of HrV infection. In this hypothetical example, He aver- age heel stick seroprevalence rate for communities with populations of 500,000 or more is 3.0 per thousand (i.e., 3 babies with antibody to HIV per 1,000 births). An intensive community heath education intervention is targeted to reduce pregnancies among women at high risk for E0V in communities where the heel stick rate is 3.0 to 3.9 per thousand. Com- munities A, B. and C have rates of 3.0, 3.4, and 3.8 per thousand, and - 34Rates of HIV depend not only on behaviors but also on the amount of infection seeded within a population or locale and over factors that can make incidence rates specious indicators of project effectiveness. It is unlikely that these rates can be adjusted to reflect initial conditions in different communities because there is a lack of reliable data on the prevalence and distribution of HIV in the U.S. population. (See discussion in Chapter 1 of the 1989 report of the parent committee [burner, Miller, and Moses, 1989].)

156 ~ EVALUATING AIDS PREVENTION PROGRAMS 6 — 0 5 to to - 4 a, Q in a' ._ Ct (:5) ~ a, _ a' cn Q ._ - CD o Q cn H ,~ E ,~' D HER / . I . ~ B 0 1 2 3 4 ~ 6 Seropositive Pregnancies per 1,000, August 1 990 FIGURE ~2a Hypo~encal example of regression displacement analysis of die effects of a contraception campaign anned at women at high risk for HIV they receive the intervention In August 1990. Communities D-] have heel stick rates of I.0, I.6, 2.0, 2.5, 4.0, 4.4, and 5.3 per thousand, so they do not receive the intervention. Twelve months later, heel stick specimens are reexamined: the antibody rates for REV do not change In the unheated communities, but Me rates in communities A-C decline by 15 to 20 percent. As illustrated in Figure 6-2a, these rates are clearly below the regression line fitted for the communities Do. This example is clear and simple; however, effects are rarely so evident. Another example will illustrate the problem of determining the functional form of the relationship. In this example, we assume Hat rates In untreated communities will not be unchanged but will fluctuate from year to year. We also assume that the communities with the highest antibody rates will be targeted (communities H. I, and J). Figure 6-2b plots the rates of communities H-] In 1990 against post-intervention rates that decline by 15 to 20 percent. The 1990 rates of communities A-G are plotted against post-intervention rates that fluctuate, untreated, by O to (plus or I}iinus) 5 percent. Results are much more difficult to interpret because the data can be fitted not only linearly but also curvilinearly.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 157 J./ - o to to - 4 a) Q in a, ._ ~ ~ 3 - a~ a; A A, , ~ - ._ o o a, co 2 - 1 - C B/ / Be A D ~ I // /// /` // E Ad/ // G /? H , . , . , 3 4 5 6 , . . 0 1 2 Seropositive Pregnancies per 1,000, August 1990 FIGURE 6-2b Alternative pre/post-intervention plots suggesting the effects of a contraception campaign aimed at women at high risk for HIV. Thus, we believe that a healthy amount of caution should be exercised in accepting the plausibility of the assumptions of regression displacement designs. The National Health interview Survey A second, perhaps more promising, source of data for quasi-experiment- ation is the National Health Interview Survey conducted by the National Center for Health Statistics. As discussed In earlier chapters, this is a weekly probability sample of the adult population of the United States. Since August 1987, the survey has provided cross-sectional data on

158 ~ EVALUATING AIDS PREVENTION PROGRAMS the population's knowledge of HIV transmission, experience with HIV testing, and other matters related to AIDS. It is conceivable that new items might be added to this survey to measure other variables of interest to evaluators. Such additional items might then allow the use of multiple nonequivalent indicators, which would help improve the prospects of conducting convincing observational studies. In Chapter 3, the pane] recommended using the Health Interview Survey to evaluate aggregate trends in knowledge and attitudes about AIDS following exposure to phases of CDC's media campaign, particu- larly its public service announcements. The evaluation design proposed was a time series analysis that would monitor mends in desired outcomes over the course of the campaign period. In order to my ze the effect of the competing explanation for change- history an extension of the measurement period before and after the broadcast could be combined with He randomized staggered implementation of the campaign as dis- cussed in Chapter 3. Even if the assignment cannot be randomized across markets, staggered implementation of the campaign would be of use in a time series design. It might also enable the use of a regression disconti- nuity design to model effectiveness, using scores on the Health Interview Survey as the decision criterion for selecting particular markets.35 The pane] recognizes that staggered unplementation of the media campaign would create extra demands on CDC's personnel and contract agency a~n~strators because it would affect the approval process for a public service announcement campaign. Such implementation would require extending the period between approval and release of some phases of the campaign. In addition, it would require changes in the expectations held by the consortium set up by the National AIDS Formation and Education Program to disseminate its public service announcements. (The current expectation is that they wiR receive new broadcast "spots" every six months.) However, the panel feels strongly Hat if the desire for evaluation is real, some changes In logistics must be Vitiated. We urge that policy makers within CDC accept the need to change the distribution schedule of public service announcements to enable the collection of data useful for a meaningful evaluation. Policy makers and staff win have to encourage patience and cooperation within their consortium of media outlets and to make these outlets understand why scheduling changes are desirable and to make the changes acceptable. 35The feasibility of this approach may be constrained by the sample size of roughly 50,000 households per year. Some markets might be represented by too few respondents to allow sufficiently precise estimates of the level of knowledge in that market.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 159 The panel recommends that CDC initiate changes in its time schedules for the dissemination of public service announce- ments to facilitate the evaluation of the media campaign. To enable the staggered implementation of television broad- casts, changes are needed in (~) the distribution schedule of public service announcements within the National AIDS Information and Education Program's consortium of media distributors and (2) the period of time between Public Health Service approval and release of new phases of the campaign. The pane} also recognizes that adding questions to the National Health Interview Survey is not a simple matter. All such items must undergo a somewhat lengthy approval process by the National Center for Health Statistics and the Office of Management and Budget. Second, the pane} understands that lengthy delays may occur in getting access to the data once they have been collected. The pane] is concerned about both of these problems and urges greater cooperation on data sharing between the National Center for Health Statistics and other divisions of CDC. The pane! recommends that CDC initiate changes in its data collection and data sharing activities to facilitate the eval- uation of the merlin campaign. To generate needed data, changes are needed in (~) the period of time for internal approval of data items for the National Health Interview Survey and (2) expeditious data sharing between the Na- tional Center for Health Statistics and other divisions of CDC. Natural Experiments As discussed earlier, one way to increase the number of situations in which a comparison group is feasible is to take advantage of natural experiments. A natural experiment occurs when membership in the treatment (versus companson) group comes about for a fortuitous reason that makes it unlikely that selection biases could operate.36 The panel has mentioned several evaluations that have used time series analysis or 36Judgments as to what is truly "fortuitous" will be open to challenge. One can, however, imagine natural experiments that arise from situations that are indeed equivalent to a Due randomized experi- ment. In the extreme case, one might imagine, for example, that a computer anomaly caused the IRS to IIiistakenly add $500 to the tax refunds of persons with Social Security numbers that ended in the digit 5. In such a fanciful event, one would have (with only some trivial assumptions) a random experiment with which to assess the impact of small, accidental windfalls upon the behavior of taxpayers.

160 ~ EVALUATING AIDS PREVENTION PROGRAMS regression displacement to estimate the effects of an intervention that has occurred exogenously. Mater we will discuss the utility of natural experiments in our discussion of selection modeling.) A "natural laboratory" can also be established if one is willing to restrict the experiment to special subgroups. For example, the effects of alternative interventions were compared between natural comparison groups by Ziffer and Ziffer (19891. These investigators delivered alterna- tive AIDS education courses to students enrolled In one-semester courses at the same college. They were not able to randomize who attended the different courses, but by restricting the intervention to a discrete pool of possible participants they obtained some plausible comparisons. (For example, investigators found that the course offering a "values and at- titudes" component had a significantly greater effect on attitude change than the basic "facts" course.) Contaminants to this research design are clearly possible: for example, students choosing an early morning class might be different from those choosing to meet In the afternoon, or students enrolling in an AIDS course offered by the psychology depart- ment might be different from those enrolling in a biology department course. Still, design flaws such as these might be handed by keeping the alternative courses as parader as possible in their contexts. Identifying Natural Experiments Finding a natural experiment or natural laboratory takes resourcefulness in identifying, defining, and recruiting the groups under analysis, as well as a bit of luck and patience. The panel believes that it is useful to be aware of and to search for situations in which natural experiments might arise. One such situation occurs when AIDS-related legislation is pending or has passed in a given state, creating a natural laboratory of persons affected by a "treatment" as well as a neighboring pool of people not so affected. For example, Florida requires that women convicted of prostitution undergo screening for a variety of sexually transmitted diseases, including REV; women who are found to be infected must submit to treatment and counseling as a condition for release (frosty and Ziegler, 1987~. In this case, We relevant outcomes of testing and counseling might be compared between convicted, involuntarily "treated" women in Florida and their counterparts in states that do not mandate such intervention.37 _ . . 37In choosing this example, the panel does not wish to imply On endorsement of mandatory testing; rather, we support the idea of recognizing Nate experiments when Hey occur.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 161 Assumptions of Natural Experiments Although the search for natural laboratones or experiments is an impor- tant way of reaping the benefits of a control group, pitfalls do occur. The investigator employing natural expenments, for example, typically has no control over selecting and implementing interventions and little control over measuring and managing controls. The unvenfiable assumption is that the comparison group is identical to the treatment group except for the lack of the intervention, and except (perhaps) for statistical adjust- ments that might be made for differences in observable characteristics of the comparison and treatment groups. Data Needs of Natural Experiments It will not be possible to confirm the assumption that groups are identical except for the intervention. It would help, however, to at least partially corroborate the assumption by examining data on pre-intervention differ- ences between two or more sites. The natural experiment of anonymous REV testing in Oregon descnbed earlier is a good illustration because investigators there used not only pre-post measures of the Oregon testing sites but also multi-site comparisons with neighboring states. Matching Without Randomization Matching is the Gird type of research design the pane] considered for developing group comparability before data are collected for a study. Under some circumstances, matching is done retrospectively; moreover, it is sometimes done in conjunction with a type of statistical adjustment caned analysis of covanance, which we wiB discuss in the following section on statistical adjustments and modeling. (Matching can also be used in a randomized expenment; if matching on factors known to have an important effect on outcome were possible, one of each pair might be randomly assigned to a project.) Comprehensive knowledge about selection factors is crucial before matching is attempted. Although matching can control known sources of bias, other variables may influence the outcome. This approach assumes not only that the matching procedures are effective In eliminating the biasing effects of matched vanables but also that these other confounding variables do not exist. When participants in a singly constituted study group can be matched in pairs, and only one of the pair receives an intervention, some extrane- ous variables may be effectively controlled (e.g., individual motivation for inclusion in the study). For example, suppose members of a cohort

162 ~ EVALUATING AIDS PREVENTION PROGRAMS study are offered a support group meeting on Tuesday evenings to prac- tice social skills associated with safer sex practices. Individuals who are unable to attend meetings on those evenings can be matched with those who do join the support group. (Note, however, Mat even in this example, We group that cannot attend on a certain night may still differ In important respects from the group that can attend.) Prospective Nonrandomized Matching In the prospective nonrandom~zed matched study, the group assigned to receive an intervention is matched with a comparison group that does not receive it. For example, it might be possible to identify four communities that are about to receive funding for CBO projects and then match them with four similar communities.38 In such designs both the treatment group and the comparison group are often given a pretest and a posttest. It is useful to recognize this design as a special case of the multiple interrupted tune series design, where the series is constituted by only two points in time. Another strategy for developing comparable treatment and compar- ison groups is the use of matched groups In an ongoing cohort design. In such a design, pairs of individuals may be identified as matched on independent variables that are assumed to be strongly correlated with the outcome of interest, such as baseline behaviors. The individuals might, for example, be matched on the level of the outcome vanable at two or more points prior to the intervention. This option is attractive because the cost of data collection on an intervention's effects is minimal since baseline data wild already have been routinely collected. Other matches can be made based on factors predicted by the underlying behavioral theory to affect outcomes. Even in cases when matching does not provide a convincing control for contaminating factors, prospective cohort studies can help generate hypotheses about which interventions are effective. These hypotheses might later be tested in a quasi-experimental or experimental setting. Retrospective Nonrandomized Matching The more typical matched study involves nonrandom~zed retrospective matching, wherein an investigator attempts to find an untreated group that, in some important respects, "looks like" a group that has received 38This research design was used, for example, lo evaluate the effects of the Your Incentive Entitlement Pilot Projects. For a project anned at reducing school dropout rates and providing work experience for teenagers, four communities were selected as pilot sites and were matched with four communities that did not receive Me program (see Betsey, Hollister, and Papageorgiou, 1985).

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 163 an intervention. The comparison group frequently is constituted from existing pools of individuals and institutions- e.g., areas served by dif- ferent hospitals, separate cities or counties, and so forth. Sometimes a comparison group is constructed from the same cohort that intervention members belong to; the comparison group are the respondents who do not, for whatever reason, receive an Intervention. Unlike prospective matches, retrospective matches do not provide the opportunity to do pretesting and multiple pre-~ntervention measurements that are individ- ually tailored to the needs of the evaluation design. Such retrospective designs yield more convincing results when data are available to permit matching of respondents on baseline behaviors. The designs are also ob- viously more convincing when differences between the treatment group and the comparison group are minimal for vanables that may be related to the outcome of interest. The designs remain vulnerable to distortion, however, by any unrecognized vanables that affect outcomes and are differently distributed in the various groups. The Multicenter AIDS Cohort Studies (MACS), which were designed to track Me natural history of HIV disease among gay and bisexual men, have offered a sewing in which the effects of AIDS interventions have been evaluated using matching strategies. One example of retrospective matching is provided by Fox and colleagues (1987), who had the ad- vantage of having collected baseline data. These investigators compared how the knowledge of one's HIV antibody status affects one's sexual ac- tivities. Some of the participants in the Baltimore MACS cohort elected to learn their antibody status (the "aware" group) while others elected not to learn their status (the "unaware" group). In a post hoc analysis, investigators matched the aware and unaware men on number of male sex partners, as well as on age, race, perception of illness, manifestation of illness, and proportion who were antibody-positive. The groups were not matched, however, on two other theoretically confounding factors- education and depressive symptoms. Sexual activities were measured at three time points prior to disclosure of test results and at one six-month interval afterward. Assumptions. Three general assumptions are made when matching without randomization is used, and all three are liable to failure. The first is that matching takes into account all of the variables that are pertinent to selection into a project and to the project's outcome. This assumption is almost certain to fail because unobserved factors are not measured whether they are individual differences (such as motivation to participate in a program) or community differences (such as decisions to adopt certain AIDS prevention policies). The second assumption is Hat the biases in the measurement of key variables (e.g., baseline and post-

164 ~ EVALUATING AIDS PREVENTION PROGRAMS intervention sexual behavior) are equivalent between the treatment group and their matched comparison group. The third assumption, implied by the first two, is that the group constructed as a comparison is fully comparable to the group that receives the treatment. A simple hypothetical example of the hazards of matching is dis- played in Table 6-1. In the top half of this table, five communities hypothetically slated to receive funding for an education project are matched with five other communities on the basis of their population size. The matched communities are approximately the same size, their populations being within three percentage points of one another, based on July 1988 census figures. However, a comparison of annual AIDS rates for the matched communities for 1988 and 1989 shows that disparities exist. For example, Baltunore and Minneapolis~t. Paul both have pop- ulations of over 2.3 million, but Baltimore had 20.3 persons with AIDS per 100,000 in 1989, and Minneapolis had 6.6. It can hardy be assumed that local needs for AIDS interventions will be the same for the two communities; likewise, the citizens of these communities should not be assumed to fee] similar inducements for changing their behaviors. The same conclusion holds if locales are matched on AIDS rates. The bottom half of the table, for example, shows that Philadelphia and Orlando both had AIDS rates of 16.1 per 100,000 In 1989, but that Philadelphia has a population five times larger than Oriando's, not to mention a different demographic composition. It is thus not safe to assume that their HIV transmission sources are comparable. Although the pane] realizes that few investigators would make matches as simplistic as these, they might attempt to match on both of these variables, and possibly other sociodemographic characteristics. Unfortunately, what the table cannot display is the dearth of community matches possible on even the two observed factors of population size arid AIDS rates. Moreover, even the best matching attempts will miss unobserved influential variables. For example, communities stated for sponsorship will probably differ in important respects from communities that have not been chosen to receive funding (they may have more per- suasive community leaders, have submitted more sophisticated requests for sponsorship, have less local AIDS funding, and so on). In addition to the failure of general assumptions, a challenge to matching without randomization is likely whenever the selection of the intervention group is made on the basis that its members are in greater need of treatment. ~ this case, the treatment group's gain over the comparison group could be explained on the basis of "regression toward the mean." This phenomenon is an improvement Hat results merely

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 165 TABLE 6-1 Hypo~encal Matchings of Ten Communities 6-la. Comparisons of 1988 and 1989 AIDS Rates Per 100,000 Between Communides Matched on Population Size 1988 1989 1988 1989 City Population Rate Rate City Population Rate Rate Baltimore 2,342,500 13.7 20.3 Minnpls-St.Paul 2,387,500 6.3 6.6 Seattle 1,861,700 14.4 19.6 Cleveland 1,845,000 7.0 6.9 Orlando 971,200 18.8 16.1 Louisville 967,000 4.6 4.5 Austin 748,500 13.4 24.5 Tulsa 727,600 6.3 6.3 Las Vegas 631,300 14.4 20.2 Syracuse 650,300 4.6 5.9 ~lb. Population Comparisons of Ten Communities Matched on 1989 AIDS Rates per 100,000 Population 1989 1989 City Rate Population City Rate Population Washington, D.C. 23.3 3,734,200 New Orleans 22.5 1,306,900 Philadelphia 16.1 4,920,400 Orlando 16.1 971,800 Phoenix 11.1 2,029,500 Wilmington 11.4 573,500 Albany/Schenectady 8.6 850,800 Baton Rouge 8.8 536,500 Salt Lake City 5.2 1,065,000 E1 Paso 5.6 585,900 NOTE. Source: Population statistics from U.S. Department of Commerce, Bureau of the Census, News, September 8, 1989; AIDS case data from U.S. Department of Health and Human Services, Centers for Disease Control, HIVIAIDS Surveillance Report, Jarluary 1990. from the group with extreme pretest scores gravitating back toward the population mean score upon retest. Data Needs. As evident from die example above, it is rarely possible to include measures of "all" Me vanables that could account for differ- ences in outcomes observed between matched members of treatment and comparison groups. The pane] repeats its belief that the state of the art in AIDS prevention research is underdeveloped wid1 respect to predicting how people would behave in the absence of the program (e.g., changes in sexual behavior or needle use), and that it is therefore difficult to specify the types of vanables that need to be considered (presumably prior risk behavior and some demographic characteristics are important, but even this is not known with certainty). Inferences. Several serious problems emerge when inferences are made from matched studies. First, it is not likely that all We relevant

166 ~ EVALUATING AIDS PREVENTION PROGRAMS differences between participants and nonparticipants will be captured by the matching vanables. Second, even if the pertinent variables were known, the method becomes difficult to implement if more than a few variables are to be matched, because fewer and fewer close pairs will be found. This practical difficulty may also necessitate reducing the desired sample size; obviously, inferences from reduced samples will be difficult to make. For example, when Chapin (1947) attempted to match on six variables, his original samples of 671 and 523 boys were reduced to samples of size 23 (reported in Cochran, 1965~.39 Finally, the measure- ment of these matching variables is, itself, subject to considerable error and bias. (See Appendix C for an extended review of the reliability and validity of common measurements of sexual and drug using behaviors.) For all of these reasons, the panel believes that widespread use of match- ing nonequivalent comparison group designs would be premature in the evaluation of AIDS intervention programs. Existing Data Sources for Matching Without Randomization Notwithstanding the inferential problems involved in matching studies, Me pane! believes that longitudinal cohort studies from which participants in an intervention may be matched with nonparticipants can be rich and useful sources of information for generating hypotheses about project effects. When cohorts are sampled in multiple sites, data collection can be enhanced by coordinating instrumentation across the locations, facilitating cross-site analyses. Cohorts of Gary Men. Several cohorts of gay men are being studied longitudinally and may serve as sources of matched pairs. For example, the National Institute of Allergy and Infectious Disease supports research on cohorts of men who have sex with men. In the early stages of the epidemic, several MACS cohorts (descnbed earlier) were put together to investigate the epidemiology of AIDS and to draw inferences about the risk factors for acquiring HIV. By administering repeated physical examinations and interviews, the MACS studies have shed light on the natural history of HIV infection and AIDS and Me occurrence of be- havioral changes made to avoid infection (i.e., reduction of unprotected anal intercourse and number of sexual partners).40 Although the MACS were not designed to evaluate interventions, they are appealing examples . 39Parametric models might be used when samples are insufficient to obtain exact matches on all vari- ables. The use of such models, however, will introduce an additional degree of uncertainty into the analysis. 40Less is known about men in smaller cities where the risk of HIV infection is lower, but it should not be assumed that their behaviors are the same. For example, Kelly and colleagues (1990) surveyed patrons of men's gay bars in Tree small southern cities (Monroe, Louisiana; Hattiesburg, Mississippi;

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 167 because the studies have been coordinated among four sites (Baltimore, Chicago, Los Angeles, and Pittsburgh) to administer the same survey instruments to measure sexual and drug use behaviors and to recruit men who were similar in certain important respects (gay and bisexual men, excluding those diagnosed with AIDS). The generalizability of MACS data is limited, however, because recruitment methods generated a sample of predominately white, midge class, urban men who made a long-term commitment to a research project on men who have same-gender sex. Several other groups of gay men may serve as additional sources. A cohort similar to the MACS groups is being followed in the San Francisco Men's Heath Study (see, e.g., ~~nkelstein et al., 19871. This study was designed at He same time as the MACS cohorts, but investigators decided during the planning stages to work independently. In addition, CDC sponsors demonstration and education projects among gay and bisexual men In six sites: Albany and New York City; Chicago; Dallas; Denver; Long Beach, California; and Seattle-King County, Washington. CDC also supports, along with the Massachusetts Department of Public Health, a longitudinal study of homosexual men Dawn from a Boston community health center (see, for example, McCusker et al., 1988~. Additionally, the National Institute of Mental Health is currently supporting longitudinal studies of behavioral change among gay and bisexual men In Chicago (e.g., Joseph et al., 1987), New York City (e.g., Martin and Dean, 1989), and San Francisco (e.g., McKusick et al., 1985~. Cohorts of Intravenous Drug Users. Two cohort studies of drug users provide a potential source of subjects for matching studies widen this population. The Treatment Outcome Prospective Study (TOPS) is a longitudinal study of drug users who receive treatment from publicly funded programs. Sponsored by the National Institute on Drug Abuse, the TOPS study seeks to understand the natural history of drug users before, dunng, and after treatment (see, e.g., Hubbard et al., 1988~. The ALIVE group in Baltimore is a cohort of drug users who are being followed to leaIn more about the natural history of HIV infection In this population. This group of active drug users without AIDS was recruited from street outreach, clinics and hospitals, and drug treatment programs (Nelson et ale, 1989~. Cohort studies such as these can provide important insights into the factors that lead to behavioral changes and that cause programs to be more (and less) effective. Because of the restricted nature of the samples used in these studies, generalization of findings from such and Bilaxi, Mississippi) and found higher rates of risky behavior than Dose reported in large urban epicenters of the disease. (Note, however, Cat the source of recruinnent~ay bam~nay account for some of the differential rates between these cities and Be epicenters.)

168 I EVALUATING AIDS PREVENTION PROGRAMS studies will usually require confirmatory studies using other populations. In the next section, we examine statistical adjustment and model- ing strategies. These strategies constitute the third major approach to nonexperimental evaluation discussed in this chapter. MODELING AND STATISTICAL ADJUSTMENTS FOR BIAS We have already looked at two broad answers to the problem of isolating treatment effects: randomization and the deliberate design of nonequiva- lent comparison groups. Another approach is to search for (or sometimes establish) a comparison group that looks somewhat like the group who re- ceived the intervention and then identify, measure, and take into account the many variables that differ between them and that are believed to affect outcome (i.e., "model" the selection bias). Success in this undertaking calls for carrying through three tasks. These tasks are to: I. Recognize the variables that may influence outcomes, 2. Measure these variables in all participants, and 3. Use the measurements In a way that correctly adjusts out- comes for group differences. In this section the panel looks at three types of modeling used to eliminate selection bias and to create comparable groups: analysis of covanance, structural equation models, and selection models. Analysis of Covariance Analysis of covariance is sometimes used on data from nonrandomized (quasi-experimental) studies to adjust outcome measurements for preex- ist~ng differences between groups. In this method, the average outcome in a treatment group comprises two components (plus random error): the effect of the treatment applied in that group, and the effects of relevant confounding variables (for example, age, sexual history, drug use history, and so on). The effects of confounding variables are estimated using some postulated model perhaps a linear regression—and each treat- ment group average is thus adjusted. The difference In adjusted average outcomes is then taken as the estimate of the difference in treatment outcomes. This approach can be very successful in some tightly controlled settings, such as laboratory expenments, where all factors believed to influence the desired outcomes may be measured and included in the

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 169 model. The success of the method is often equivocal, however, In com- plex social science investigations because it formally requires identifi- caiion, measurement, and use of all important confounding vanables.4~ Unfortunately, it is not generally true that correct adjustment for some of the intervening variables will be an improvement over the unadjusted difference. Indeed, one can make projects look far worse or far better than they are by using this approach. For decades, analysis of covanance has received wide use in be- havioral and social Investigations. This use is statistically well justified in randomized field experiments. In nonexperiments, however, its use has frequently turned up riddles and complications (Boruch and Riecken, 1975; Boruch, 19861. Assumptions Covar~ance adjustment is dependent on the assumptions of its model, e.g., the assumption of a linear, additive (or overt functional relationship between the covarying independent variables and the outcome measure; this relationship is presumed to be the same in all groups. Moreover, the model assumes that the error term is independent of the treatment and the covanates. Finally, it assumes that no unspecified factors exist that affect both selection and outcome. Any observational method that attempts to control for selection bias must rely on assumptions that cannot be verified (or can only be ~mper- fectly vended) about the factors that affect the likelihood that individuals will learn of, enroll in, and participate in a project. Examples of sev- eral failures of these assumptions and the long debates about appropriate models and inferences can be found in the various evaluations that were performed on the Head Start program. Because Head Start involved sev- eral curricular models and three different cohorts of children, venous data sets have been analyzed and them results antiqued. The original Head Start evaluation used comparison groups that matched the experimental sample on age, sex, race, and kindergarten attendance. This cohort was not pretested. In the program evaluation, investigators constructed an index of socioeconomic status as the single covanate, and the conclu- sion from the analysis of covanance was Hat the program was largely ineffective. This conclusion was roundly cnticized because it assumed, incorrectly, that its covanate adequately measured the confounding fac- tors that affected selection and outcome. 4~ That is, variables that ( 1 ) influence outcomes, and (2) are not equivalently distributed in the treaunent and comparison groups.

170 | EVALUATING AIDS PREVENTION PROGRAMS Subsequent analyses illustrated the danger of assuming that selection factors have been adequately controlled by covanate adjustment. For example, Barrow (1973) reanalyzed the original Head Start data by different racial/ethnic groups and found the program to have positive effects on black and Mexican American children and negative effects on white children, a quite different finding than that of the ong~nal study. In addition, Bryk and Weisberg (1976) analyzed the evaluation of another Head Start cohort that adjusted for, among other things, scores on a Preschool Inventory test. The authors cnticized the test as being unreliable, and they argued that the covanance mode! underadjusted for pretest differences between groups and that it overestimated program effects. Later, Magidson (1977) reported small positive effects of the program on white children when he reanalyzed Barnow's data using a model that incorporated allowances for measurement error In the tests and postulated correlations between disturbance terms in a structural equation mode] (an evaluation method that will be discussed below). As such lO-year controversies indicate, the process of inference from such nonrandomized designs can be contentious, and closure may be difficult to achieve. This fact was well anticipated by Cord (1967) who observed that "no logical or statistical procedure can be counted on to make proper allowances for uncontrolled pre-existing differences between groups." Data Needs Because the selection process influences the ultimate effectiveness of a project, the efforts made by prevention projects to attract and retain participants are of interest. The ability to attract and retain individuals is itself composed of at least two-elements: (1) the motivation of the individuals and (2) the ease of access to the program, including its visibility, its convenience In terms of location and hours of operation, md so on. Data on these factors can be collected during the process evaluations, along with other data on program characteristics and program implementation discussed in Chapters 3 5;42 however, the pane! would note that data on motivations is likely to be difficult to collect and fraught with problems (see, for example, Turner and Martin, 1985: Vol.1., Ch.5, 7-9~. 42 Data on factors that attract and retain participants can also provide information about important communication links with the population at risk. Such data permit the identification of subgroups that were better attracted to the project, those that were missed, and those who benefited most. This, in turn may help in informing other projects about how to enlist participants, and about which characteristics new projects should avoid or adopt.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 171 Inferences The success of the analysis of covar~ance procedure depends on how well its users understand the relationship between underlying variables and the outcomes of interest. For example, if the exact nature of the relationship between age and outcome were known, statistical procedures might be used to mode! and adjust the estimates of effect to account for differences In the age composition of the treatment and comparison groups. In the evaluation serdngs we contemplate In AIDS prevention, the state of the art unfortunately does not include knowledge of the important factors dete~in~ng outcome, let alone measured values for them. It is thus tempting in this situation to measure instead a large collection of ah possible variables thought to relate to outcome. If this is done, a decision must be made as to which factors to include in the model and how they should be included. Without strong theory to guide this selection, equivocal answers may result, the approach of measuring as many Wings as possible and adjusting as best as possible cannot be relied upon to set mugs nght. 3 Structural Equation and Selection Models Structural Equation Models Simply put, a structural equation mode} is a statistical equation or sys- tem of equations that represents a causal linkage among two or more vanables.44 In the context of an evaluation study, this procedure may use complex models of behavior to explain Me effects of an intervention on a desired outcome, as mediated Trough a series of intervening variables and as co-influenced by factors that are exogenous to the intervention. Structural equation models representing these processes may be expressed by path diagrams representing Me patterns of influence among these vari- ables. ~ such diagrams the relationships specified by the equations are represented In a network of causal linkages. A structural equation mode} embodies fin a series of equations) an explicit theory about ways In which one variable In a mode} may (and may 43 The reader may wish, for example, to review the role played by selection and compliance factors in Me Clofibrate experience discussed in the section on "Compromised Randomizahon." 44A variety of texts on the topic are available. For example, Goldberger and Duncan (1973) was a seminal effort that brought together theoretical work by econometricians, psychologists, and soci- ologists. Duncan's (1975) monograph introduced the topic at a simple statistical and logical level. Applications of modeling in the behavioral sciences and education were reviewed by gentler (1980). In addition, Dwyer (1983) has written a comprehensive technical text on structural equation modeling, which includes discussions of the issues of causal inferences and potential uses of modeling.

172 ~ EVALUATING AIDS PREVENTION PROGRAMS not ~ affect the other vanables in the model. Plainly, reliance upon such systems of equations require well-articulated theory or prior emp~ncal knowledge In the present instance of the factors that influence risk- taking behavior and project participation. In particular, this procedure presumes the existence of: (1) explicit theory or emp~ncal evidence about causal relationships among vanables and the functional form of these relationships, (2) large samples, especially when the theory is complex, and (3) measurements of the variables specified In the theory. Where there are sufficient theory and data to use this technique, structural modeling has important advantages. In particular, the structural modeling approach: · forces the analyst to be explicit in articulating theory, · facilitates an analytic focus upon how well a particular system of structural equations (a theory) fit the data, · facilitates comparison of competing models or theories, provides an intuitively appealing way of representing com- plex effects, including direct and mediated chains of causal influence, and when successful, such modeling efforts provide some basis for predicting outcomes that may occur as conditions change (e.g., via a policy change that alters the distribution of one or more model variables). It is nonetheless this panel's judgment that it is unlikely In the near term that structural equation modeling of nonrandom~zed studies could provide a firmer basis for evaluating the effectiveness of ADDS prevention programs than well-executed randomized experiments. This judgment follows from three considerations: · theory about causal relations is weak In AIDS settings, · emp~ncal data that would substitute for theory are sparse, · the estimates produced by structural models will be subject to considerable uncertainties because competing theories are likely to be plentiful. For these reasons the panel believes that structural equation modeling would not permit one to make causal inferences (i.e., to declare that a project works) or to measure Me magnitude of effects (how well it works) win nearly the same level of confidence that a well-executed experiment does. Although the panel is not optimistic about our present ability to use structural equation models and data from nonrandomized studies as the primary strategy for evaluating the effectiveness of AIDS prevention

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 173 programs, we do believe that such models will have a role to play and we suspect that this role may grow In the future. In particular, the panel believes that much might be gained by the judicious use of such models as an adjunct to randomized experiments. Modeling efforts might be used, for example, to unprove our understanding of the Individual and contextual factors Cat mediate between a treatment and an outcome. Furthermore, as experience accrues In situations where modeling is done in tandem with experiments, we anticipate the development of theory and data that may allow modeling approaches to substitute for some experiments in the future. Selection Models Another approach to nonexperimental program evaluation comes under the heading of selection modeling. Here the problem of nonequivalent comparison groups is addressed by focusing explicitly on the determinants of project participation. An analysis is conducted of the reasons why some individuals participate and others do not, in the hopes of locating observable vanables that can be used to control for, and eliminate, the unmeasured differences between participants and nonparticipants. Such analyses require credible assumptions and sufficient data to explore them. Because this procedure may be new to non-economists, the panel has included a background paper that details selection modeling procedures (Appendix F). A shorter exposition is presented below. Investigators using selection modeling procedures have generally taken one of two approaches (see, e.g., Heckman and Robb, 1985a and 198Sb). The first involves an implicit search for natural experiments (which are expected to present an identifying variable, or set of vanables, that theoretically affects project participation but does not directly affect the outcome vanable). The second approach involves controlling for the determinants of project participation through the use of longitudinal, cohort, or retrospective data sets containing information on the histories of project participants and nonparticipants. Selection Models and Natural Experiments. The simplest selection mode] uses one variable proxy for unobservables. Let us take as all example a natural experiment in which neighborhood A and neighbor- hood B have equal rates of AIDS incidence and populations with similar demographics and sexual behavior histones. Neighborhood A, however, has more counseling and testing participant slots, purely for political reasons perhaps the congressional representative from A enjoys senior- ity over B's representative. Such an identifying variable would explain

174 ~ EVALUATING AIDS PREVENTION PROGRAMS why participation in neighborhood A is greater, and it would be unre- lated to unobservable differences in the preproject levels of the outcome variable. Once an identifying variable or set of such variables is found,45 there are several statistical methods that may be used to obtain treatment est~mates.46 The best-known technique is a two-step method called the "Heckman lambda" method in which: (~) an equation is estimated for the determinants of project participation, and (2) from the results of the first step, a "selectivity bias" variable is constructed that is intended to control for the mean levels of the unmeasured differences between participants and nonparticipants. Selection modelers use this second-stage analysis to estimate the effect of program participation on the outcome variable. Selection Modeling and Historical Controls. The second approach of selection modelers involves identifying determinants of project par- ticipation from available data on the histories of project participants and nonparticipants. The philosophy is that the individual histories, taken as a whole, will serve as a proxy for unobservable vanables that account for differences in outcomes between groups. The simplest example is one in which unmeasured differences between participants and nonparticipants are explained by preproject measures of the outcome variable. For ex- ample, if the difference between those who do and do not participate in a counseling and testing project arises solely because the participants practiced risk reduction behaviors more frequently at some point prior to Weir entry into a project, having information of their behavior at that point and controlling for it statistically might eliminate the selection bias. This may not always be possible, of course those who participate may have decided to start practicing less risky behavior between Me time their behavior was measured and the time Hey entered the project. More generally, if information is available for many different points in the past, a fairly complete history of sexual or drug use behavior can be controlled for. Selection bias will then remain only if participants have the same contemporaneous characteristics (age, location, education, and so on) and the same sexual or drug injection history as nonparticipants, but participants are nevertheless different In some unmeasured way Hat is not independent of the outcome vanable (e.g., their intention to start practicing less risky behavior in the future). 45 Identifying variables can be area characteristics, as in this example, or they can be individual charac- teristics; they may be discrete variables (city or neighborhood location) or continuous variables (dollar level of funding); or some combination of all of these. 46These methods have been explored in depth in the selection modeling literature (e.g., Barnow, Cain, and Goldberger, 1980; Maddala, 1983; Heclcman and Robb, 1985a, 1985b). r

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 175 Assumptions of Modeling One great merit of both structural equation and selection modeling tech- niques is that the assumptions needed for these models are made explicit (even though they may be unverifiable). If models are used in an eval- uation, their assumptions should be fully reported, so that readers of the research can understand and weigh their plausibility. Moreover, whenever possible, evaluators who use such models need to provide an assessment of whether the study's assumptions were actually met; this assessment can be made with the help of subject-matter experts in HIV disease prevention. The potential uses of selection modeling In evaluation has generated considerable controversy between economists and statisticians. Examples of that controversy may be found in papers and commentary presented at a 1985 symposium (Heckman and Robb, 19g6a, 1986b; Hartigan, 1986, Tukey, 1986a, 1986b), in an exchange recently published by the Jour- nal of the American Statistical Association (Heckman and Hotz, 1989a, 1989b; Holland, 1989; Moffitt, 1989), and in the remarks made bypar- ticipants at the conference held by this pane! in January 1990. At present there are divergent and strongly held opinions about the potential uses and misuses of these procedures, and no practical experience applying these procedures to the task of evaluating AIDS prevention programs. In the following pages, the panel briefly reviews some of the key issues that have emerged in the debate over selection models. Given He present state of scientific debate concerning the applicability of these models, it is He panel's belief that it would be unwise to rely upon selection modeling as the mainstay for an AIDS evaluation strategy. Rather the pane! believes that careful efforts should be made to learn more about the applicability of these procedures to the task of AIDS evaluation. Such experience may dictate wider use of these procedures in the future. A critical question In the selection modeling approach is whether selection bias can be eliminated by either of the two methods described above, and whether the analyst can determine when selection bias has been eliminated and when it has not. Critics of this approach have ob- served that evaluations using selection modeling procedures have often been unsuccessful in replicating the results of randomized controlled experiments (see LaLonde, 1986; Boruch, 1986; and Fraker arid May- nard, 19861. Because of the sensitivity of modeling approaches to the assumptions that have to be made, critics claim that selection modeling is unreliable and that it creates substantial uncertainty as to true effects. Some analysts (Fraker and Maynard, 1986; LaLonde, 1986) argue more flatly that the approach does not eliminate selection bias and Hat the requisite determination cannot be made. Both these analysts find that

176 ~ EVALUATING AIDS PREVENTION PROGRAMS TABLE 6-2 Five models of estimated manpower training effects on 1979 earnings Estimate of Effects from MODEL Estimated Effects Using Nonexpenmental Groups Experiment (s.e.) PSID-1 (s.e.) CPS-SSA-1 (s.e.) Treatment Earnines Less Comparison Earnines Men Women One-Staze Econometnc Model (Controlling for pre-training earnings and all observed variables including women's AFDC status in 1975) Men Women Iwo-Staze Econometnc Models; Not in earnings equation but included in participation equation: Marital status, residency in an SMSA, 1976 employment status, number of children women's AFDC status in l9iS Men Women 1976 employment status, number of children Men Women No exclusion restrictions Men Women 886 (476) 851 (307) -15,578 (913) 3,357 (403) (896) 2,097 (491) -1,133 (820) 1,129 (385) -1,161 (864) 1,564 (604) -8,870 (562) -3,363 (320) -805 (484) 1,041 (503) -22 (584) 1,102 (323) ~67 (905) 213 (588) 1,747 (620) 805 (523) NOTES: s.e. is standard error of estimate. The estimated training effects are in 1982 dollars. PSID-1 comparison group members were household heads in the Panel Study of Income Dynamics poverty subsample, continuously from 1975 through 1978, who were less Man 55 years old and did not classify themselves as retired in 1975. CPS-SSA-1 comparisons were individuals from the Current Population Survey/Social Security Administration matched file, who were in the labor force in March 1976, with nominal income less than $20,000 and household income less than $30,000 (women between ages 2~55 years of age and men less than or equal to 55 years of age). SOURCE: LaLonde, 1986:Tables 4, 5 and 6. Because LaLonde shows two-stage estimates for only two candidate comparison groups, these are the comparisons reported here. (LaLonde shows eight female candidate comparison groups in Table 4 and six male candidate comparison groups in Table 5.)

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 177 estimates of effects are extremely sensitive to the particular "identify- ing" variables chosen (in the first method) and to how much preproject history is controlled for (in the second method). LaLonde, for example, makes side-by-side comparisons of the effects of a manpower training program estimated from a randomized experiment with estimates derived from several nonexperimental models. Table 6-2 reproduces selected es- timates from five of these models in order to compare the results of the randomized control group with estimates from two possible comparison groups. It will be seen from Table 6-2 that experimentally measured effects are similar for men and women at around $870, as seen in the top two lines of the table. The corresponding estimates using nonexperimental control groups are found in the same top two tines and are greatly different: For men the +$886 figure would be replaced by -$15,578 or -$S,870, depending on which comparison group is chosen. For women the observed experunental difference of +$851 would be replaced by -$3,357 or -$3,363. Later lines in the table offer over model-based estimates of the gains for men and for women. One cannot help but be struck by their poor agreement win the expenmentally found effects. On the other side of this argument, Heckman and Hotz (1989a) have argued that many of the different selection models estimated by Fraker and Maynard and by LaLonde can, In fact, be tested and rejected as invalid in the sense that they fit the data poorly. Heckman and Hotz claim that a set of "best" models can be selected by standard statistical methods of hypothesis testing. These authors use nonexperimental data and methods of mode] selection to argue Hat He effect estimate obtained in a particular randomized trial can be reproduced win a "best" selection mode} using nonexperimental data (e.g., the random-growth estimates of effects for high school dropouts and the linear control function estimates of effects for AFDC recipients in Table 6-31. It will be seen from Table 6-3 that the unrejected linear control function estimate for AFDC recipients closely approximates the estimate derived in a randomized trial (+$267 vs. +$23S, with standard errors of 162 and 152 respectively). The random-grow~ estimates for high-school dropouts, however, were also unrejected and they would replace an experimentally estimated effect of +$9 by negative effect estimates of -$154 and -$724, respectively. For the latter estimates the inference of effect is also complicated by the larger standard errors that were obtained from the nonexperimental analysis (212 and 502, versus 173 for the expenment).

178 ~ EVALUATING AIDS PREVENTION PROGRAMS TABLE ~3 Experimental and Nonexpenmental Estimates of Training Effects for School Dropouts and AFDC Recipients, 1979. (Nonexperimental estimates that were not rejected by Heckman and Hotz's statistical tests are underlined.) School Dropouts AFDC Recipients Est. (s.e.) Est. (sped) Estimates from Experiment Nonexpenmental Estimates 1: Linear control function estunates* Vanant 1 Vanant 2 Weighted average (AFDC Model Not Rejected) Fixed-effect estimates constructed with 1972 pretraining earnings* Variant 1 Variant 2 l 3: Fixed-effect estimates constructed with 1974 . . . ~ pretrammg earnmgs- Vanant 1 Vanant 2 4: Random-erowth estimates constructed with 9 (173) 267 (162) -1884 (247) -1827 (246) -2172 (277) -2070 (275) 544 -1663 (301) 522 -1636 (269) 500 - 1973 pretraiIiing earnings* Vanant 1 Variant 2 Weighted average (Dropout Model Not Rejected) 5: Random-grow estimates constructed with 1974 pretraining easings* Variant 1 Variant 2 Weighted average (Dropout Model Not Rejected) 724 (502) -154 (212) 238 (152) 508 (193) (195) (179) (184) -217 -263 (546) (557) 1109 (576) 860 (576) NOTES: Est. is Estimate of effect; s.e. is standard error of estimate. Vanant 1 assumes that changes in the outcome variable will be the same for nonparticipants as for participants in the absence of treatment. Vanant 2 assumes that changes in the outcome variable will vary for people with different characteristics; the average of sample means is shown. SOURCE: Heckman and Hotz, 1990: Tables 3 and 4. (Heclanan and Hotz show additional rejected estimates based on different controlling variables, which the reader may be interested in . . ~ examining.) *Controlled for race, sex, marital status, age, education, residency in an SMSA, and participation in NSW (dropouts) or Current Population Survey (AFDC recipients aged 18 64, with dependent children aged 16 and under).

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 179 Data Needs of Models Modelers attempt to control for selection bias by constructing plausible models of selection and then considering the types of data that might be available for fitting a model. In each case, investigators show the assumptions that need to be met in order to obtain consistent estimates of the effects of an intervention. Perhaps modelers' major conclusion is that the assumptions necessary to obtain consistent estimates grow less restrictive the richer the data that are available. Multiple comparison groups, multiple independent vanables, multiple outcomes, and multiple time points all reduce the number of unvenfiable assumptions needed to fit a model. The price paid for this reduction in restrictive assumptions, however, is an increase In error vanance, as discussed below. The minimal data set needed to construct structural and selection models is a single cross-section of post-treatment information (which might be obtained, e.g., from a compromised experiment). The assump- tions involved with cross-sectional data have already been described for structural equation models. In addition, use of data from a single cross- section to estimate a selection model requires that the investigator find a plausible identifying viable (that is, a va table that influences par- ticipation in the program but not the outcome). If longitudinal data are available from a random sample from periods both prior to and after the treatment, weaker assumptions can be made. For some of the estimation methods, it is more important to have multiple periods of pre-project data than post-project data because the former permit a more adequate control for individual histories prior to the intervention. To some extent, selection models confront the same inferential prob- lems as natural expenments. How can it be known with certainty that, for example, neighborhood location is indeed independent of the un- measured differences in the outcome variable between participants and nonparticipants? As discussed in reference to natural experiments, the plausibility of the assumption cannot be tested without gathering addi- tiona] information. Such information may take the form of institutional knowledge of how the Public Health Service allocates its limited funds or information about the details by which different counseling and testing projects are funded in different neighborhoods and different cities. A1- ternatively, it may take the form of a search for yet additional sources of natural experimental variations in an attempt to determine whether both neighborhood variation and some other type of natural variation give the same, or similar, estimates of effects. Similarly, one may ask how it can be known with confidence that a single pre-project data point is sufficient to eliminate selection bias?

180 ~ EVALUATING AIDS PREVENTION PROGRAMS In this case, the collection of additional data—namely, data from points farther in the pas~might be used to make the determination. If the single pre-project data point is sufficient, then controlling for additional pre-project data points will not affect the estimate of effects. One possible consequence of using historical controls, as proposed in Appendix F. is an increase in statistical uncertainty. As in many areas of statistics, a tradeoff must be made between the potential reduction in bias and the potential increase in variance associated with a method that relies on weaker assumptions. The measures proposed for historical controls In selection modeling rely on multiple observations of the outcome vanable at `different points in tone. Estimates based on first, second, and higher order differences involve sums and differences of additional vanables, and these estimates can, as a result, have higher variances than simple cross- sectional differences. The actual variance will depend on the structure of the correlations between successive observations. If the observations are sufficiently correlated (which, of course, is the premise of the approach), the variance of estimates based on histoncal controls can be lower than those based on cross-sectional data. If, on the other hand, there is no correlation between successive measurements, an estimate based on first differences would have twice the variance as one based on contemporary controls, and one based on third differences would have twenty times the vanance. One cannot know, a priori, which situation will apply, but "noisy data" (in which the measurement vanability is high relative to real changes in the underlying variable) would argue against incorporating multiple observations.47 Inferences from Modeling The panel's concerns about establishing comparability using structure equation or selection models is Mat they may fail (without detection) because of either flaws in data quality48 or errors in the assumptions that are made about the relationship of variables that affect selection and outcome. First, at the risk of being redundant, the panel wishes to em- phasize that our understanding of the behavioral and other character~shcs 47 After Me analysis has been completed, of course, it may be possible to estimate the variances and explicitly consider the tradeoffs between bias and variance. 48Data quality is an issue for every design. It is a particular concern in the case of modeling efforts (versus experiments) because the device used to produce comparability between groups in a model can be affected by the errors and biases of the measurements, e.g., self-reports of sexual behaviors prior to entry into the program. These measurements are subject to both random errors and to bias in reporting (see Miller, Turner, and Moses, 1990: Ch. 6; included as Appendix C to this report). In a well-executed randomized experiment, the method of producing comparable groups is not subject to such uncertainties.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 181 that are important in changing behaviors that transmit HIV is woefully inadequate. The absence of tested theory or empirical evidence shrinks the basis for the assumptions required by these models. Second, Here are also problems that stem from the quality of the available data. As reported in Miller, Turner, and Moses (1990: Ch.6; reprinted as Appendix C), extant surveys of sexual and drug using prac- tices are nddIed with bias. Although the authors are sanguine about the feasibility and replicability of such survey measurement, they note that convincing evidence of measurement validity is sadly lacking for AIDS risk behaviors. In addition to data problems, the wide variety of estimates possible from both structural and selection models also dampens our enthusiasm. As some critics have pointed out, the model-selection tests are designed to determine if ovendentified data fit a model; if the data fit, the test does not reject the model. Moreover, it has been argued that a mode} that fits the data may not adequately describe the selection function when additional data are considered (see, e.g., Holland, 1989~. It is also true that any structural equation mode] that fits the data is only one of many (Dwyer, 1983~. Just because He data/t does not mean that the model's effect estimates are valid. The pane] takes a guarded view of the current suitability of modeling approaches to estimate the effects of AIDS prevention programs. The appropriateness of this view may change as experience accumulates In evaluating AIDS programs. ~ this regard the panel notes that selection models all aim at one desideratum: consistency.49 But a consistent es- timate can have so much vanability as to be of no practical value for samples of realistic size. Lacking a theoretical understanding on this mat- ter we can either ignore these methods or get some structured experiences using them, which is the panel's proposed resolution. The Role of Mode is The unresolved debate between modelers and their critics and the avail- able evidence lead us to two conclusions. One, the panel finds little evidence to persuade us Hat it would be prudent to rely extensively on selection models in the AIDS arena In the near future. Two, we believe that it may be wise to obtain further evidence about the performance and 49"Consistency" roughly means: a statistic t is a consistent estimate of a parameter T. if with high confidence the differences between t and T approach zero as N goes to infinity. But for N equal to 100 or 1,000 or 1,000,000 that difference need not be serviceably small. (It should be noted that the estimation procedures will, of course, seek efficiency, i.e. niiliiniization of variance between expected and observed values.)

182 ~ EVALUATING AIDS PREVENTION PROGRAMS characteristics of these models. Overall, the field of selection modeling is relatively new, having been developed in the 1970s and applied to pro- gram evaluation In the 1980s (see, e.g., Heckman, 1979; Barrow, Cain, and Goldberger, 1980), and its full potential has not yet been realized. The pane} would like to see its value empirically established, and we believe the federal government should fund the appropriate research. The pane' recommends that the National Science Foundation sponsor research into the empirical accuracy of estimates of program effects derived from selection mode' methods. One avenue would be to provide researchers financial support and access to randomized controlled tnal data to see if they can reproduce the findings with nonrandomized compansons. Progress along these lines, if achieved, would be valuable indeed. It would help answer the scientific question about how well nonexperimental approaches work in estimating effects. In addition, this recommendation may help to solve problems not only in the AIDS intervention arena but also in other areas in which it is difficult to implement randomized controlled expenments.s° WHEN SHOULD NONRANDOMIZED APPROACHES BE CONSIDERED? Although the pane] believes Cat randomized controlled experiments ought to fond the backbone of CDC's evaluation strategy, we under- stand that they cannot constitute the exclusive strategy. Under some conditions, randomization is precluded, and an evaluation cannot be con- ducted unless other methodologies are considered. In this section the pane] looks at a set of five conditions under which investigators should seriously consider the use of nonrandomized approaches.5~ Some of the conditions are implicit In the earlier text; for example, where an empirical 50Another role that models play is to provide a framework for "sensitivity analyses," that is, assess- ments of the sensitivity of estimates of program effects to departures from the assumptions used in the analysis. When data are not available on selection variables and unverifiable assumptions about the selection process have to be made, researchers can adjust the assumptions in a variety of ways to examine the range of possible estimates of effects (within the contexts of specific models). These results are suggestive of the range of plausible project effects (assuming that a particular model is well specified). This would be useful information to have and it could lead to a more stringent evaluation of the intervention at a later time. 51 Some scientists argue that there is an additional situation in which nonrandomized approaches should be used, namely when the nonrandomized study is more relevant to the full-scale implementation because the intervention when ultimately implemented on a national scale, will be accompanied by attrition, self-selection, spillovers, etc. This is a complex issue, but the panel does not, in general, believe that the "messiness" of the real world implementation of a program necessarily argues for a nonexperimental design. The panel believes that the choice of method should be deterrruned by what is feasible, and what will provide an

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 183 base needs to be established, alternative observational methods may be quite useful. En addition, positive answers to the following questions in- vite serious consideration of qllasi-expenmentation and nonexperimental methods. ]. Can the decision maker tolerate serious ambiguity in estimating the elect of a program or project? When effects are uncertain and randomization is not feasible, non- random~zed approaches may lead to ambiguous, but tolerable, conclu- sions. Quasi-experimental evaluations, for example, attempt to estimate an intervention's effect while taking into account plausible competing explanations for its apparent effect, but Bose competing explanations may involve "causes" that are not finely identifiable or estimable. For example, Kelly and colleagues (1990) were satisfied simply to detect an effect and to tolerate an imprecise measure of effect estimates in their test of a "diffusion-of-innovation" intervention at the community level.52 By limiting the study to a few sites it was less costly to deliver the ~n- tervention. (As discussed under the section on randomized experiments, a sufficient number of units is needed to detect differences in outcome variables, and Be cost of delivering an intervention to the necessary num- ber of units e.g. communities may be prohibitively expensive when appropriate level of certainty about the answers that are obtained. In that regard, it should be borne in m-und that only some of the "messiness" of a real-world implementation is relevant to assessing the effect of the intervention as implemented. Other aspects have little to do with the intervention, but are of concern to researchers. For example, in some instances sample attrition will be part of the phenomenon of interest; that is, people will be lost from a treatment (intervention programs itself and that is Dan of Me outcome (recall tor example the case of Antabuse noted on page 136.) In such cases, as discussed previously, the appropriate analysis should assess the overall outcome (including loss of persons through attrition from the intervention). In other instances, the delivery of an intervention (treatment) may be less than optimal when a program is broadly implemented. In such cases, one might envision an evaluation whose "treatment" was the "intention to deliver the intervention program" and not the optimal delivery of the intervention in a carefully controlled setting. A design for such a study of the joint impact of the treatment and its real-world implementation could be a randomized experiment, or not. It should be noted, however, that some of the real-world problems one encounters are not ga- mane to the effectiveness of the program although they can make the research task difficult. Sample attntion, for example, may reflect only a poorly implemented data collection plan (and not a treatment outcome). That is, the intervention may have been well delivered to almost all participants, but long- term, post-treatment data gathering may have suffered from a high rate of sample attntion. In this case, there is only a failure of the research effort; it has no necessary consequence to the effectiveness of the intervention. A better executed data collection program might remove this defect; it is not a characteristic of die intervention itself. 52Rather than randomize communities, investigators chose three "relatively isolated" communities and measured baseline sexual behaviors at two points prior to the intervention. No intervention was directed at two of the communities, while the third received an intervention that taught risk reduction strategies to men identified as opinion leaders. At two points later, evaluators applied the same be- havioral measures and found statistically significant decreases in risky behaviors for the intervention population relative to the comparison community populations.

184 ~ EVALUATING AIDS PREVENTION PROGRAMS effects are uncertain.) On the other hand, the lack of randomization into participation categories means that it is not possible to discount the possibility that something other than the planned intervention occurred in the ~ntervennon community to cause the change. Thus, this exam- ple accepts some ambiguity, but it also may be more feasible than a randomized expenment. In addition, the example reveals the value of quasi-expenments in giving investigators experience with intervention procedures before they are deployed In an experimental study. 2. Are the competing explanations for the project's elect reasonably assumed to be negligible? The number of instances In which competing explanations are neg- ligible will be few. They do exist, however. In testing an algebra course for Bird graders, for example, it will often be safe to assume that third graders who are not involved In the curriculum win not learn algebra on their own. A before-and-after design would then be sufficient to estimate the effect of the cumcuTum on children's knowledge.S3 Simi- larly, a before d-after design might be acceptable to test the elects on schoolchildren of art intensive CBO project to reduce stigmatization of a prospective seropositive classmate. The media campaign's effects on this population might be assumed to be negligible (given the late hour of most broadcasts of national public service announcements about AIDS and the reading level required for published matenals). Any changes In attitudes might then be attnbuted to the intervention. 3. Must the program be deployed to all relevant individuals or institu- tions that are eligible? As mentioned earlier, a community-wide intervention project to prevent AIDS may be swiftly implemented and offered to all eligible residents, thus saturating the community and precluding the random assignment of individual residents to experimental and control condi- tions. Consequently, any evaluation design will have to depend on quasi- expelimental or statistical adjustment methods. For example, a time series analysis of trends In condom sales, visits to STD clinics, and sales Of safe sex videos or books might be implemented. Note, however, that when multiple sites are involved, the pane] suggests that communities themselves might be randomly assigned to an intervention or to a control condition in the interest of estimating the effects of the program. 53 Before-and-after evaluation designs are discussed in Chapter 4.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 185 4. WiR a nonrandomized approach meet standards of ethical propriety while a randomized experiment will not? As discussed in Chapters 4 and 5, random assignment to an interven- tion or to a control group fails to meet standards of ethical propriety if resources are In ample supply to provide the intervention, it is not other- wise available, and the beneficial effects of the intervention are assumed to outweigh any negative effects. HIV testing, for example, is believed to be an effective medical care procedure, thus making a randomized no-treatment control inappropriate for estimating the effect of CDC's counseling and testing program.54 In this case, it might be possible to use a time series design to examine the effectiveness of a new counseling and testing setting on the accessibility of services. For example, suppose a small community with HIV test facilities in its public heath and family planning clinics wishes to open a new site specifically to attract gay men. Before opening the new site, the community can count the number of test takers using test facilities by their risk exposure group (as identified In Figure 5-l in Chapter S). After the new site is open, the number of test takers by risk group can be recounted (actually, a series of before- and-after measurements would be preferred). If the number of gay test takers increases (without a corresponding decrease in the over categories to which they may have assigned themselves), it might be inferred that the new project was effective in attracting gay men. 5. Are theory- or d~ta-based predictions of effectiveness so strong that nonexperimental evidence will suffice?55 In some cases, theory may predict dramatic effect sizes. It is often (but not always) true that the larger the expected impact of an ~nter- vention, He less accurate an evaluation technique one needs to discern that impact. Extremely persuasive educational and prevention projects might, for example, produce such large effects that the impact would be convincingly evident even with observational designs that are more vulnerable to bias. In other cases, an intervention may have previously been shown to make a difference under a given set of circumstances or within a given subgroup using a randomized experiment. In these cases, suppose the generalizability of this finding is not known, and an investigator wishes to test the intervention in a different setting or among a different target group. Under these circumstances, the inferences from an observational study may be sufficiently convincing as to preclude the 54See Appendix D for furler discussion of the ethical concerns of evaluating patient care procedures. 55Note that it is important to differentiate well-founded predictions of effectiveness from "coTrunon knowledge" of what works. Too often hunches or instincts about what works have stood in the way of deciding to conduct a well-controlled randomized study.

186 ~ EVALUATING AIDS PREVENTION PROGRAMS need for a full-scale experiment. Consider, for example, the case of a counseling support project that has been tested in a randomized controlled experiment and shown to increase gay men's behavioral skills for refusing sexual coercion (Kelly et al., 19891. The support project's effectiveness among women partners of intravenous drug users, however, is unknown. To test it, a quasi-experiment might be designed. In a final section, below, the pane] considers the ~nvestigator's final assessment of the results of his or her study, whether it be a randomized experiment or not. INTERPRETING EVALUATION RESULTS The goal of outcome evaluation is to determine how well a project works. Part of this deterrn~nation, no matter the method chosen for evaluation, involves an investigator's interpretation of results. The degree of certainty that the observed outcomes result from the intervention is a function of, among other things: the reasonableness of the assumptions behind the evaluation strategy, the quality and amount of the data, and the plausibility of counter-hypotheses that could account for the observed data. It is also important for interpretation to address whether results are specific to a given set of circumstances or are generalizable to other populations. Randomized Experiments Assuming that randomized controlled trials are used, the assumptions underlying the inference of effects are generally easy to verify, which will facilitate acceptance of a study's interpretation. One still needs, however, to examine the data on project participants and the project itself, to insure the internal validity of the experiment. Such validation is needed to be sure that the project and randomization were implemented as designed and that the degree of attrition is acceptably small. If these conditions are satisfied, differences between units can be analyzed using standard statistical tests. In the end, even if the results are strongly encouraging for a subgroup of a population, generalizability will often be uncertain. The results from a single experiment may allow strong and rather precise inferences of causality, but because they are likely to be based on small, selective samples, they may be equivocal in terms of how He project will work in other groups, other settings, and other regions of the country. Whatever is known about the experiment should be communicated in the interpretation of results.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 187 Nonrandomized Methods Nonrandomized methods make greater use of assumptions than random- ized Dials. In interpreting the results of such studies the plausibility of these assumptions must be considered (and reported) because they wiB vary from one design to another, and they are crucial to He inferences that will be drawn. Moreover, investigators need to analyze Be sensitivity of their inferences to the likely amount of departure from these assumptions. Accessibility of Assumptions All of the alternatives to randomization have one Ming in common they rely on assumptions that are not directly verifiable. The nonexperimental alternatives differ, however, In the nature of the assumptions that are necessary. Observational studies, natural experiments, and matching approaches tend to make assumptions that, although they may not be directly verifi- able, can be expressed in accessible everyday terms. Comparison groups must be similar to treatment groups in every respect (other than the treat- ment) that might influence the outcome; there must be no changes other than the treatment between pretest and posttest; and so on. To the extent that we know the factors that influence the outcome variable, we may be able to assess whether there are differences between comparison and treatment conditions. Analysis of covanance, selection models, structural equation models and other statistical techniques require assumptions that are generally expressed in formal statistical terms that are somewhat removed from everyday experience. Analysis of covanance, for instance, assumes that the relationship between outcome variables, covanates, and the treatment can be adequately and fully expressed In a particular form of (single- equation) statistical model. Selection models based on histoncal controls assume Mat the treatment and comparison groups are similar wide respect to the first, second, Bird, or higher order differences over time In the Outcome variable. Structural equation models make complex assumptions about the covanance structure among all of variables in Be model. The appropriateness of such assumptions can be quite difficult to assess, even if one is familiar win the statistical language and the subject matter, and external validation data are often unavailable. Although there are some statistical techniques for testing the inadequacy of the requisite assumptions for all of these models, Mere is no general way to determine Mat Be assumptions hold. In summary, compared to quasi-experimental designs, the complex statistical alternatives to randomization require more elaborate assump- tions Mat can be quite difficult to verify.

188 ~ EVALUATING AIDS PREVENTION PROGRAMS interpretation Besides plausible assumptions, the interpretation of observational studies is also a function of data quality and competing hypotheses for change in the observed outcomes. The panel has addressed both of these issues in this chapter, but we wish to add a note on how competing hypotheses might be ruled out in a more trustworthy way. A set of six criteria developed by Hill (1971) to assess observational studies In the field of medicine are of interest. These criteria, which have been modified over the years, point out the need to take into account the whole of the evidence, not just selected studies, in interpreting whether an observed association is causal. A recent report of the Committee on Diet and Health (1989) restated HiD's criteria to include the following: · the strength of association between He intervention and the observed outcome, · He "dose-response relationship" in which greater effects are demonstrated from more intense ~eatments,56 a temporally correct association (i.e., an appropriate fume sequence between the intervention and the observed out- come), · the consistency with which similar associations are found in a variety of evaluations, · the specificity of He association, and · plausibility (i.e., the supposed causal association comports with existing knowledge). Although several of these criteria are applicable to the findings of any one study, the consistency of association and the notion of plausibility argue that a study also be interpreted in the context of other findings. One of the greatest difficulties for observational studies to surmount is their vulnerability to counter-hypotheses that could account for differ- ences between the comparison and treatment groups (based on factors other than the intervention). Although this problem is inherent to the approach, certainty about a particular causal inference increases as a reservoir of similar findings is accumulated across studies using disparate methods. What is more, even flawed studies can be convincing when a body of evidence is compiled. When data are drawn from several studies, however, they are some- t~mes difficult to compare because the studies use different definitions of target audiences, different specifications of causal variables, different . 56A ceiling effect may sometimes appear, thus diluting Me dose-response relationship.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 189 outcome measures, different wordings of survey questions, and so on. These differences make it hard to compare results across studies, and de- tract from their interpretation as a whole. Moreover, differences between studies also make results difficult to generalize, regardless of whether expenmental or nonexperimental studies are used. We believe that a way exists to improve their interpretability. The pane! recommends that the Public Health Service and other agencies that sponsor the evaluation of AIDS preven- tion research require the collection of selected subsets of common data elements across evaluation studies to ensure comparability across sites and to establish and improve data validity and reliability.57 Questions about a project's applicability to other populations require information on the populations for which the project succeeded, peculiar- ities of the region or the population that were important to its success, and the cost of the project and possible areas for cost reduction. The hope is that an evaluation that suggests success for a particular project In one area win lead to a rapid implementation of the project In similar regions and to its gradual implementation in regions less and less similar to the original site evaluated, so that the generalizability of the Initial finding is not assumed to stretch too far without emp~ncal verification. None of this is meant to imply that the pane] urges scores of evalua- tions. The pane} believes that more certain and useful knowledge will be gained by a smaller number of wet/-executed studies than by a precipi- tous rush to assess the effects of every prevention program that is being mounted. At present, the panel believes the randomized experiment to be the most appropriate design for outcome evaluation, both In teas of clarity and dispatch of results, all else being equal. At the same tune, we recognize that the strategy will not always be feasible or appropriate and, for these situations, other designs may have to be deployed until evidence accumulates to make their interpretation dependable or until a randomized experiment can be conducted. REFERENCES Barnow, B. S. (1973) The effects of Head Start and socioeconomic status on cog- nitive development of disadvantaged children. Ph.D. dissertation. University of Wisconsin. 57Fu~ermore, methodological research is urgently needed to study the validity and reliability of be- havioral measurements. Appendix C is devoted to a discussion of these issues.

190 ~ EVALUATING AIDS PREVENTION PROGRAMS Barnow, B. S., Cain, G. G., and Goldberger, A. S. (1980) Issues in the analysis of selectivity bias. In E. W. Stromsdorfer and G. Parkas, eds., Evaluation Studies Review Annual, Vol. 5. Beverly Hills, Calif.: Sage Publications. gentler, P. M. (1980) Multivanate analysis with latent vanables: Causal modeling. Annual Review of Psychology 31:419~56. gentler, P. M. (1990) Structural equation modeling and AIDS prevention research. Presented at the NRC Conference on Nonexperimental Approaches to Evaluating AIDS Prevention Programs, Washington, D.C., January 12-13. Berk, R. A., and Rauma, D. (1983) Capitalizing on nonrandom assignment to treatments: A regression~iscontinuity evaluation of a cnme-control program. Journal of the American Statistical Association 78:21-27. Betsey, C. L., Hollister, R. G., and Papageorgiou, M. R., eds. (1985) Youth Employment and Training Programs: The YEDPA Years. Report of the NRC Committee on Youth Employment Programs. Washington, D.C.: National Academy Press. Boruch, R. F. (1986) Comparative aspects of randomized experiments for planning and evaluation. In M. Bulmer, ea., Social Science Research and Government. New York: Cambridge University Press. Boruch, R. F., and Riecken, H. W., eds. (1975) Experimental Tests of Public Policy.. Boulder, Cola.: West Press. Box, G. E. P., and Tiao, G. C. (1965) A change in level of non-stanonary time senes. Biometrika 52:181-192. Bryk, A. S., and Weisberg, H. I. (1976) Value-added analysis: A dynamic approach to the estimation of treatment effects. Journal of Educational Statistics 1:127-155. Campbell, D. T. (1990) Quasi-experunental design in AIDS prevention research. Pre- sented at He NRC Conference on Nonexperimental Approaches to Evaluating AIDS Prevention Programs, Washington, D.C., January 12-13. Campbell, D. T., and Stanley, J. C. (1966) Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally. Chapin, F. S. (1947) Experimental Designs in Sociological Research. New York: Harper. Coates, T. J., McKusick, L., Kuno, R., and Stites, D. P. (1989) Stress reduction training changed number of sexual partners but not immune function in men with HIV. American Journal of Public Health 79:885-887. Cochran, W. G. (1965) The planning of observational studies of human populations. Journal of the Royal Statistical Society, Part 2, 128:234-255. Cook, T. D., and Campbell, D. T. (1979) Quasi-Experimentation: Design & Analysis Issues for Field Settings. Boston: Houghton Mifflin. Committee on Diet and Heals (1989) Diet and Health: Implications for Reducing Chronic Disease Risk. Report of the NRC Food and Nutrition Board. Washington, D.C.: National Academy Press. Coronary Drug Project Research Group (1980) Influence of adherence to treatment and response of cholesterol on mortality in He coronary drug project. New England Journal of Medicine 303:1038-1041. Duncan, O. D. (1975) introduction to Structural Equation Models. New York: Academic Press. Dwyer, J. H. (1983) Statistical Models for the Social am1 Behavioral Sciences. New York: Oxford University. Ehrenberg, A. S. C. (1968) The elements of lawlike relationships. Journal of the Royal Statistical Society, Series A, 131:280-302.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 191 Emmett, B. P. (1966) The design of investigations into the effects of radio and television progranuT~es and other mass communications. Journal of the Royal Statistical Society, Part 1, 129:2649. Fehrs, L. J., Fleming, D., Foster, L. R., McAlister, R. O., Fox, V., et al. (1988) Trial of anonymous versus confidential human immunodeficiency virus testing. Lancet 2:379-382. Fisher, B., Redmond, C., Fisher, E. R., Bauer, M., Wolmark, N., et al. (1985) Ten-year results of a randomized clinical trial comparing radical mastectomy and total mastectomy with or without radiation. New England Journal of Medicine 3 12:674-681. Fleiss, J. L., and Tanur, I. M. (1973~1,he analysis of covariance in psychopathology. In M. Hammer, K. Salzinger, and S. Sutton, eds., Psychopathology: Contributions from the Social, Behavioral, and Biological Sciences. New York: John Wiley & Sons. Fox, R., Odaka, N. 3., Brookmeyer, R., and Polk, B. F. (1987) Effect of HIV antibody disclosure on subsequent sexual activity in homosexual men. AIDS 1:241-246. Praker, T., and Maynard, R. (1986) The Adequacy of Comparison Group Design for Evaluations of Employment-Related Programs. Princeton, N.J.: Mathematica Policy Research. Friedman, S. R., Rosenblum, A., Goldsmith, D., Des Jarlais, D. C., Sudan, M., et al. (1989) Risk factors for lIIV-1 infection among street-recruited intravenous drug users in New York City. Presented at the Fifth International Conference on AIDS, Montreal, June 4-9. Fuller, R. K., Branchey, L., Brightwell, i). R., Derman, R. M., Emuck, C. D., et al. (1986) Disulfiram treatment of alcoholism: A Veterans Administration cooperative study. Journal of the American Medical Association 256:1449-1455. Goldberger, A. S., and Duncan, O. D., eds. (1973) Structural Equation Models in the Social Sciences. New York: Seminar Press. Gostin, L., and Ziegler, A. (1987) A review of AIDS-related legislative and regulatory policy in the United States. La~v, Medicine & Health Care 15:5-16. Hartigan, J. (1986) Discussion 3: Alternative methods for evaluating the impact of intervention. In H. Wa~ner, ea., Drawing Inferences from Self-selected Samples. New York: Springer-Verlag. Heckman, J. J. (1979) Sample selection bias as a specification error. Econometrica 47:153-162. Heckman, J. J., and Robb, R. (1985a) Alternative methods for evaluating the impact of interventions: An overview. Journal of Econometrics 30:239-267. Heckman, J. J., and Robb, R. (1985b) Alternative methods for evaluating the impact of ~nterventions. In J. Heckman and B. Singer, eds., Longitudinal Analysis of Labor Market Data. Cambridge: Cambridge University Press. Heckman, J. J., and Robb, R. (1986a) Alternative methods for solving the problem of selection bias in evaluating the impact of treatments on outcomes. In H. Wainer, ea., Drawing inferences from Self-selected Samples. New York: Springer-Verlag. Heckman, J. J., and Robb, R. (1986b) Postscript: A rejoinder to Tukey. In H. Wainer, ea., Drawing Inferences from Self-selected Samples. New York: Springer-Verlag. Heckman, J. J., and Hotz, V. J. (1989a) Choosing among alternaiive nonexperimental me~ods for estimaiing ~e impact of social programs: The case of manpower training. Journal of the American Statistical Association 84:862-874.

192 ~ EVALUATING AIDS PREVENTION PROGRAMS Heckman, J. J., and Hotz, V. J. (1989b) Rejoinder. Journal of the American Statistical Association 84:878-880. Hennigan, K. M., Del Rosano, M. L., Heath, L., Cook, T. D., Wharton, J. D., and Calder, B. J. (1982) Impact of the introduction of television on crime in the United States: Empirical findings and theoretical implications. Journal of Personality and Social Psychology 42:461-477. Hill, A. B. (1971) principles of Medical Statistics. 9th ed. New York: Oxford University Press. Holland, P. W. (1989) Comment: It's very clear. Journal of the American Statistical Association 84:875-877. Hubbard, R. L., Marsden, M. E., Cavanaugh, E., Rachal, J. V., and Ginzburg, H. M. (1988) Role of drug-abuse treatment in limiting the spread of AIDS. Reviews of Infectious Diseases 10:377-384. Joseph, J. G., Montgomery, S. B., Emmons, C. A., Kessler, R. C., Ostrow, D. G., et al. (1987) Magnitude and determinants of behavioral risk reduction: Longitudinal analysis of a cohort at risk for AIDS. Psychology and Health 1:73-95. Kelly, J. A., St. Lawrence, J. S., Hood, H. V., and Brasfield, T. L. (1989) Behavioral intervention to reduce AIDS risk activities. Journal of Consulting and Clinical Psychology 57:60-67. Kelly, 3. A., St. Lawrence, J. S., Stevenson, L. Y., Diaz, Y. E., Hauth, A. C., et al. (1990) Population-wide risk behavior reduction through diffusion of innovation following intervention with natural opinion leaders. Presented at the Sixth International Conference on AIDS, San Francisco, June 23. LaLonde, R. J. (1986) Evaluating the econometric evaluations of training programs with experimental data. American Economic Review 76:604-620. Lohr, W. (1972) An historical view of the research on the behavioral and organizational factors related to the utilization of health services. Social and Economic Analysis Division, Bureau for Health Services Research and Evaluation, Rockville, Md. January. Lord, F. M. (1967) A paradox in the interpretation of group compansons. Psychological Bulletin 68:304-305. Maddala, G. S. (1983) Limited-Dependent Variable and Qualitative Variables in Econo- metrics. Cambridge: Cambridge University Press. Magidson, J. (1977) Toward a causal model approach for adjusting for preexisting differences in the nonequivalent control group situation. Evaluation Quarterly 1:399420. Martin, J. L., and Dean, L. (1989) Risk factors for AIDS related bereavement in a cohort of homosexual men in New York City. In B. Cooper and T. Helgason, eds., Epidemiology and the Prevention of Mental Disorders. London: Routledge & Kegan Paul. Maxwell, S. E., and Delany, H. D. (1990) Designing Experiments and Analyzing Data. Belmont, Calif.: Wadsworth Publishing. McCusker, J., Stoddard, A. M., Mayer, K. H., Zapka, J., Morrison, C., and Saltzman, S. P. (1988) Effects of HIV antibody test knowledge on subsequent sexual behaviors in a cohort of homosexually active men. American Journal of Public Health 78:462~67. McGlothlin, W. H., and Anglin, M. D. (1981) Shutting off methadone. Archives of General Psychiatry 38:885-892.

RANDOMIZED AND OBSERVATIONAL APPROACHES ~ 193 McKay, H., McKay, A., and Sinisterra, L. (1973) Stimulation of intellectual and Social Competence in Colombian Preschool-Age Children Affected by the Multiple Deprivations of Depressed Urban Environments. Second Progress Report. Call, Colombia: Human Ecology Research Station, Universidad del Valle. September. McKusick, L., Horstman, W., and Coates, T. J. (1985) AIDS and sexual behavior reported by gay men in San Francisco. American Journal of Public Health 75:493496. Miller, H. G., Turner, C. F., and Moses, L. E. (1990) AIDS: The Second Decade. Report of the NRC Committee on AII:)S Research and the Behavioral, Social, and Statistical Sciences. Washington, D.C.: National Academy Press. Moffitt, R. A. (1989) Comment. Journal of the American Statistical Association 84:877-880. Moffitt, R. A. (1990) Applying Heckman methods for program evaluation to CDC AIDS prevention programs. Presented at the NRC Conference on Nonexperimental Approaches to Evaluating AIDS Prevention Programs, Washington, D.C., January 12-13. Mood, A. M. (1950) Introduction to the Theory of Statistics. New York: McGraw-Hill. Nelson, K. E., Vlahov, D., Margolick, J., and Bernal, M. (1989) Blood and plasma donations among a cohort of IV drug users. Presented at the Fifth International Conference on AIDS, Montreal, June 4-9. Riecken, H. W., and Boruch, R. F., eds. (1974) Social Experimentation: A Method for Planning and Evaluating Social Intervention. Report of a Committee of the Social Science Research Council. New York: Academic Press. Silverman, W. A. (1977) The lesson of retrolental fibroplasia. Scientific American 236~6~:100-107. Smith, H. S. (1957) Interpretation of adjusted treatment means and regressions in analysis of covariance. Biometrics 13:282-308. Transportation Research Board (1984) 55: A Decade of Experience. Special Report 204 of the NRC Committee for the Study of the Benefits and Costs of the 55 MPH National Maximum Speed Limit. Washington, D.C.: National Academy Press. Tukey, J. W. (1986a) Comments. In H. Wainer, ea., Drawing inferences from Self- selected Samples. New York: Springer-Verlag. Tukey, J. W. (1986b) Discussion 4: Mixture modeling versus selection modeling with nonignorable nonresponse. In H. Wainer, ea., Drawing inferences from Self-selected Samples. New York: Springer-Verlag. Turner, C. F. and Martin, E. (1984) Surveying Subjective Phenomena. Two volumes. New York: Russell Sage. Turner, C. F., Miller, H. G., and Moses, L. E., eds. (1989) AIDS, Sexual Behavior, and Intravenous Drug Use. Report of the NRC Committee on AIDS Research and the Behavioral, Social, and Statistical Sciences. Washington, D.C.: National Academy Press. Valdise~ri, R. O., Lyter, D. W., Leviton, L. C., Callahan, C. M., Kingsley, L. A., and Rinaldo, C. R. (1989) AIDS prevention in homosexual and bisexual men: Results of a randomized trial evaluating two risk reduction interventions. AIDS 3:21-26. Wilder, C. S. (1972) Physician Visits, Volume and Interval Since Last Visit, United States-1969. Acts and Health Statistics, Series 10, No. 75. Rockville, Md.: National Center for Health Statistics.

194 ~ EVALUATING AIDS PREVENTION PROGRAMS W~nkelstein, W., Samuel, M., Padian, N. S., Wiley, J.A., Lang, W., Anderson, R. E., and Levy, J. A. (1987) The San Francisco Men's Health Study. m. Reduction in human immunodeficiency virus transmission among homosexuavbisexual men, 1982-86. Americar: Journal of Public Health 77:685-689. Ziffer, A., and Ziffer, J. (1989) The need for psychosocial emphasis in academic courses on AIDS. Presented at the Fifth International Conference on AIDS, Montreal, June 4-9.

Next: Appendixes »
Evaluating AIDS Prevention Programs: Expanded Edition Get This Book
×
Buy Paperback | $60.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

With insightful discussion of program evaluation and the efforts of the Centers for Disease Control, this book presents a set of clear-cut recommendations to help ensure that the substantial resources devoted to the fight against AIDS will be used most effectively.

This expanded edition of Evaluating AIDS Prevention Programs covers evaluation strategies and outcome measurements, including a realistic review of the factors that make evaluation of AIDS programs particularly difficult. Randomized field experiments are examined, focusing on the use of alternative treatments rather than placebo controls. The book also reviews nonexperimental techniques, including a critical examination of evaluation methods that are observational rather than experimental—a necessity when randomized experiments are infeasible.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!