Read "Modern Methods of Clinical Investigation" at NAP.edu

« Previous: 9. An Introduction to a Bayesian Method for Meta-Analysis: The Confidence Profile Method

Page 117 Cite

Suggested Citation:"10. Should We Change the Rules for Evaluating Medical Technologies?." Institute of Medicine. 1990. Modern Methods of Clinical Investigation. Washington, DC: The National Academies Press. doi: 10.17226/1550.

Page 118 Cite

Page 119 Cite

Page 120 Cite

Page 121 Cite

Page 122 Cite

Page 123 Cite

Page 124 Cite

Page 125 Cite

Page 126 Cite

Page 127 Cite

Page 128 Cite

Page 129 Cite

Page 130 Cite

Page 131 Cite

Page 132 Cite

Page 133 Cite

Page 134 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

10 Should We Change the Rules for Evaluating Medical Technologies? DAVID M. EDDY Before we launch a new medical technology, we would like to show that it satisfies four criteria: · It improves the health outcomes patients care about pain, death, anxiety, disfigurement, disability. · Its benefits outweigh its harms. · Its health effects are worth its costs. · And, if resources are limited, it deserves priority over other technologies. To apply any of these criteria we need to estimate the magnitude of the technol- ogy's benefits and harms. We want to gather this information as accurately, quickly, and inexpensively as possible to speed the use of technologies that have these properties and direct our energy away from technologies that do not. There are many ways to estimate a technology's benefits and harms. They range from simply asking experts (pure clinical judgment) to conducting multi- ple randomized controlled trials, with anecdotes, clinical series, data bases, non- randomized controlled trials, and case-control studies in between. The choice of a method has great influence on the cost of the evaluation, the duration of time required for the evaluation, the accuracy of the information gained, the complexity of administering the evaluation, and the ease of defending the sub- sequent decisions. The problem before us is to determine which set of methods delivers infor- mation of sufficiently high quality to draw conclusions with confidence, at the lowest cost in time and money. 117

118 DAVID M. EDDY CURRENT EVALUATIVE METHODS Currently, very different methods are used to evaluate different types of med- ical technologies. There are some amazing inconsistencies. In some settings, we insist on direct evidence that compares the effects of the technology against suitable controls, using multiple randomized controlled trials in a variety of set- tings. In other settings, we do not require any direct comparison of the technol- ogy and a control, or any explicit comparison of the technology's benefits ver- sus its harms or costs. A good example of the first strategy is the evaluation required by the Food and Drug Administration for approval of drugs. I will never forget my first exposure to a new drug application. It described more than a dozen randomized controlled trials involving about 2,000 patients. It filled a room; consisted of 65,000 pages, which, if stood in a pile, would reach 49.5 1/2 feet; cost more than $10 million; required four years to complete; and needed a truck to haul it to Washington. At the other end of the spectrum is the evaluation of medical and surgical procedures. For most, there are no randomized controlled trials at all. There are even inconsistencies within these categories. For example, we can insist that a pharmaceutical company produce the finest evidence that a drug alters some intermediate outcome (e.g., intraocular pressure), but require no controlled evidence at all that changing the intermediate outcome improves the outcome of real interest to patients (e.g., loss of visual field or blindness). We can require dozens of randomized controlled trials to demonstrate that a drug is effective for a particular indication, and leave it to pure clinical judgment to determine its effectiveness for other indications. These inconsistencies have tremendous implications for the quality of care, the cost of research, and the time required to get effective innovations into widespread use. Consider just the implications for costs. If we demanded at least two randomized controlled trials for every innovation, research costs would be increased by billions of dollars a year. If we were to accept clinical judgment for every innovation, we could save billions of dollars that we now spend on randomized controlled trials, and speed the introduction of new tech- nologies by years. Given these inconsistencies and their implications, it is worthwhile to ask what information we are really trying to gather with our system for evaluating medical technologies. That might help us determine the best way to gather it. WHAT ARE WE TRYING TO LEARN? We need two things to make decisions about a technology. First, we must estimate the approximate magnitude of its benefits and harms. Second, we must determine the range of uncertainty about the estimates. These two points are so

RULES FOR EVALUATING MEDICAL TECHNOLOGIES 119 crucial to an understanding of the different methods for evaluating technologies that they are worth discussion. Suppose the outcome of interest is the probability of dying after a heart attack and the technology is a thrombolytic agent. Suppose an experiment has been conducted with 400 patients randomly allocated to receive either the treat- ment (200 patients) or a placebo (200 patients). Finally, suppose that during the follow-up period, 20 patients in the placebo group died of heart attacks, while 10 patients in the treated group died. Thus, without treatment, the chance of dying of a heart attack is 20 in 200 or 10 percent; with treatment, the chance of dying of a heart attack is 10 in 200, or 5 percent. The magnitude of the effect of treatment is a 5 percent decrease in the chance of dying of a heart attack (10 percent- 5 percent = 5 percent). This effect is shown as the large arrow in Figure 10.1. For a variety of reasons (e.g., sample size), there is uncertainty about an esti- mate of this type. This uncertainty can be displayed in terms of confidence intervals or probability distributions. For this particular example, Me 95 percent confidence intervals for the estimated effect of We technology range from 0.2 -0.4 I J 1~ I I -0.2 ~ 0 0.2 0.4 DIFFERENCE IN PROBABILITY FIGURE 10.1 Results of randomized controlled clinical trial of hypothetical treatment for heart attacks. Best estimates of the effect (large arrow), 95 percent confidence limits (small arrows), and probability distribution of the effect (solid line).

120 DAVID M. EDDY J ~ A -0.4 -0.2 0 0.2 0.4 DIFFERENCE IN PROBABILITY FIGURE 10.2 Probability distributions of effects of two hypothetical Reagents for heart attacks. Treannent A (best estimate of effect is ~.05), treatment B (best estimate of effect is - 0.25). percent to -10.2 percent. These are indicated by the smaller arrows on the graph. The range of uncertainty can also be displayed as a probability distribu- tion for the effect of the technology; it is shown as the solid line. (The height of the distribution at any point reflects the probability that the true reduction in mortality is near that point. Thus, the most likely value for the reduction in mortality is the value under the highest point of the distribution, 5 percent). Both the estimated magnitude of the technology and the range of uncertainty are important. For example, it makes a big difference whether the technology reduces the chance of dying of a heart attack by 5 percent or by 25 percent (see Figure 10.2~. It also makes a big difference whether the range of uncertainty is _5.2 percent or +8.3 percent (see Figure 10.3~. To people who interpret statisti- cal significance rigidly, there is even a big difference between a range of uncer- tainty of_5.2 percent and _4.9 percent (see Figure 10.4~.

RULES FOR EVALUATING MEDICAL TECHNOLOGIES 121 1 W? -0.4 -0.2 0 0.2 0.4 DIFFERENCE IN PROBABILITY FIGURE 10.3 Probability distributions of effects of a hypo~ei~cal treatment for heart attacks as estimated from two randomized controlled clinical trials. Trial A (narrow range of uncertainW), trial B (wider range of uncertainW). QUALITY OF INFORMATION IN DEFERENT DESIGNS: FACE VALUE All methods for evaluating a technology, from the lowly pure clinical judg- ment to the lofty randomized controlled trial, provide information on the magni- tude of effect and the range of uncertainty. Furthermore, for the empirical methods, if it were reasonable to take each method at face value (i.e., if it were possible to assume that there were no biases to internal or external validity), then all the designs, case for case, would be almost equally good at estimating the magnitude and range of uncertainty of an outcome. Stated another way, if all the results could be taken at face value, randomized controlled trials, case for case, would not provide any more precise or certain information than designs that are considered less rigorous, such as non-randomized controlled trials, case-control studies, comparisons of clinical series, or analyses of data bases. Consider, for example, two studies of breast cancer screening in women over age 50. One was a randomized controlled trial of approximately 100,000 women (58,148 women in the group offered screening and 41,104 women in the

122 DAVID M. EDDY -0.4 -0.2 0 0.2 0.4 DIFFERENCE IN PROBABILITY FIGURE 10.4 Probability distributions of effects of a hypothetical treatment for heart attacks as estimated from two randomized controlled clinical Dials. Tnal A statistically significant (solid line), trial B not statistically significant (dashed line). control group) (1~. After seven years of follow-up, there were 71 breast cancer deaths in the group offered screening and 76 breast cancer deaths in the control group. The other was a case-control study in which 54 cases (women who had died of breast cancer) were matched three to one with 162 controls (women who had not died of breast cancer) (2~. Retrospective analysis of screening histories found that 11 of the 54 cases had been screened, compared with 73 of the 162 controls. The probability distributions for the percent reduction in mortality implied by these two studies, taken at face value, are shown in Figure 10.5. The degree of certainty (as indicated by the variance or width of the distribution) is just as high for the case-control study as for the randomized controlled trial. The main determinant of the variance of the estimate is not the total number of people involved in the study (100,000 women in the randomized controlled trial versus 216 in the case-control study), but the number of outcomes of interest that occurred (in this case, breast cancer deaths). The variances in the two stud- ies are similar largely because there were almost as many outcomes in the case- control study (54) as in either group of the randomized controlled trial (71 and 76, respectively).

RULES FOR EVALUATING MEDICAL TECHNOLOGIES DOM A\ 1 123 \ Sweden A \ O 0.5 1 1.5 2 ODDS RATIO FIGURE 105 Probabilitr distributions of two controlled clinical trials of breast cancer screening taken at face value. Swedish positive RCI`, DOM retrospective case control study (2). Thus, if we take the two studies at face value, it is clear that they provide vir- tually the same information. The difference between the studies is logistics. The randomized controlled trial involved recruiting and randomizing 100,000 people, screening about half of them, and following everyone for more than a decade. The logistics of the case-control study, on the other hand, were much simpler and less expensive. It required identifying only 54 women who died of breast cancer (the cases), 162 women matched by year of birth who have not died of breast cancer (the controls), and retrospective ascertainment of which women had been screened. The study collapses down to about 200 women and can be done in six months. Similar stories can be told about the other designs. Provided the number of cases with the outcome of interest are similar, the degree of certainty in the face value estimates of all the designs will be similar. But the logistics can be vastly different. To push the example to the extreme, if there were a data base that had the pertinent records, the logistics would be as simple as doing the computer runs. So, if the quality of the information gained by different designs is essentially the same, but the logistics, costs, and time required are very different, the choice

124 DAVID M. EDDY of the best design should be quite simple: pick the fastest and least expensive. What is wrong with this picture? BIASES The problem, of course, is with the assumption made at the beginning of the story. There we postulated that it was reasonable to take each design at face value that is, to assume that there were no biases. In fact, there are biases that affect all evaluative methods. Furthermore, the effects of biases determine the rules for evaluating technologies. To solve the problem of choosing the best evaluative methodologies, we need some background on biases. It is convenient to separate biases into two types. Biases to internal validity affect the accuracy of the results of the study as an estimate of the effect of the technology in the setting in which a study was conducted (e.g., the specific technology, specific patient indications, and so forth). Biases to external validi- ty affect the applicability of the results to other settings (where the techniques, patient indications, and other factors might be different). Examples of biases to internal validity include patient selection bias, crossover, errors in measurement of outcomes, and errors in ascertainment of exposure to the technology. Patient selection bias exists when patients in the two groups to be compared (e.g., the control and treated groups of a controlled trial) differ in ways that could affect the outcome of interest. When such differ- ences exist, a difference in outcomes could be due at least in part to inherent dif- ferences in the patients, not to the technology. Crossover occurs either when patients in the group offered the technology do not receive it (sometimes called "dilution") or when patients in the control group get the technology (sometimes called "contamination". Errors in measurement of outcomes can affect a study's results if the technique used to measure outcomes (e.g., claims data, patient interviews, urine tests, blood samples) do not accurately measure the true outcome. Patients can be misclassified as having had the outcome of inter- est (e.g., death from breast cancer) when in fact they did not, and vice versa. Errors in ascertainment of exposure to the technology can have an effect similar to crossover. A crucial step in a retrospective study is to determine who got the technology of interest and who did not. These measurements frequently rely on old records and fallible memories. Any errors affect the results. An example of bias to external validity is the existence of differences between the people studied in the experiment and the people about whom you want to draw conclusions (sometimes called a "population bias"~. For example, they might be older or sicker. Another example occurs when the technology used in the experiment differs from the technology of interest, because of differ- ences in technique, equipment, provider skill, or changes in the technology since the experiment was performed. This is sometimes called "intensity bias." Different evaluative methods are vulnerable to different biases. At the risk

RULES FOR EVALUATING MEDICAL TECHNOLOGIES TABLE 10.1 Susceptibility of various designs to biases 125 _ ~ _Z Design Patient Selection Crossover Error in Measure- ment of ment of Popu- Tech- Outcomes Exposure ration nology Error in Ascertain- External Validity ment of RCT nonRCT CCS Companson of clinical series Data bases O ++ + + 11 0 1 1+ 0 ++ O + O ++ ++ + + + + 1 1+ 0 0 + O + + 11 11 0 0 O implies mammal vulnerability to a bias. + I I implies high vulnerability to a bias. RCT, randomized controlled trials; CCS, case-control studies. Of gross oversimplification, Table 10.1 illustrates the vulnerabilities of different designs to biases. A zero implies that the bias is either nonexistent or likely to be negligible; three plus signs indicate that the bias is likely to be present and to have an important effect on the observed outcome. Methodologists can debate my choices, and there are innumerable conditions and subtle issues that will prevent agreement from ever being reached; the point is not to produce a defini- tive table of biases, but to convey the general message that all the designs are affected by biases, and the patterns are different for different designs. For example, a major strength of the randomized controlled trial is that it is virtually free of patient selection biases. Indeed, that is the very purpose of ran- domization. In contrast, non-randomized controlled trials, case-control studies, and data bases are all subject to patient selection biases. On the other hand, ran- domized controlled trials are more affected by crossover than the other three designs. All studies are potentially affected by errors in measurement of out- comes, with data bases more vulnerable than most because they are limited to whatever data elements were originally chosen by the designers. Case-control studies are especially vulnerable to misspecification of exposure to the technol- ogy, because of their retrospective nature. Data bases can be subject to the same problem, depending on the accuracy with which the data elements were coded. With respect to external validity, randomized controlled trials are sensitive to population biases, because the recruitment process and admission criteria often result in a narrowly defined set of patient indications. Randomized controlled trials are also vulnerable to concerns that the intensity and quality of care might be different in research settings than in actual practice. The distinction between the "efficacy" of a technology (in research settings) and the "effectiveness" of a technology (in routine practice) reflects this concern. Thus, the results of a trial

126 DAVID M. EDDY might not be widely applicable to other patient indications or less controlled set- tings. Data bases and case-control studies, on the other hand, tend to draw from "real" populations. All designs are susceptible to changes in the technology, but in different ways. Because they are prospective, randomized controlled trials and non-randomized controlled trials are vulnerable to future changes. Because they are retrospective, case-control studies and retrospective analyses of data bases are vulnerable to differences between the present and the past. Now that we admit that biases are present and potentially important, our problem becomes much more complicated. We can no longer choose the sim- plest, quickest, and least expensive design. Now the choice must take potential biases into account. HEURISTICS It is easy to imagine that this new problem is extremely complicated. In fact, it can be argued that it exceeds the capacity of the unaided human mind. What, then, do we do? After all, this is a real problem that we have been facing for decades. In response to the complexity of the problem, we have developed a set of mental simplifications, or heuristics, that convert what would be a very compli- cated set of judgments into a series of rather simple "yes" and "no" questions. The first and most important heuristic deals with the biases. Typically, we sim- plify our approach to biases by sorting them into two categories. For each design and each bias, we either declare that bias to be acceptable, take the study at face value, and ignore the bias from that point on; or we declare the bias to be unacceptable, and ignore the study from that point on. This said, it is important to understand that different people can have very different ideas about what constitutes an "acceptable" bias. Someone who believes in only the most rigor- ous randomized controlled trials (what we might call a "strict constructionist") might say the potential biases of data bases (or case-control studies, or non-ran- domized controlled trials) are too great to accept. On the other hand, a clinical expert might be quite content to take anecdotes and clinical series at face value. The next heuristic deals with the difficulty of estimating the magnitude of an effect (e.g., the magnitude of the reduction in mortality achieved by breast can- cer screening). That can be quite complex, especially if there are multiple stud- ies with different designs and different results. A much simpler approach is to determine if there is any effect at all, without worrying about its actual magni- tude. In practice, we calculate the probability that the study would indicate there is an effect when in fact there is not the statistical significance of the study. If the result is statistically significant, we feel good, even if the actual magnitude of the effect is very small, or if we have not even estimated the actual magnitude. The third heuristic deals with the difficult balance between the possibility of rejecting a technology that in fact is effective, versus accepting a technology that in fact is not effective. The most visible heuristic is to declare a technology effective when the "p-value" the chance of accepting an ineffective technolo-

RULES FOR EVALUATING MEDICAL TECHNOLOGIES 127 gy falls below 5 percent. This heuristic can be applied without ever calculat- ing the chance of the first type of error (rejecting an effective technology). The mesmerizing power of this statistic can be surprising. For years, the p-value of another study that examined breast cancer screening for women younger than age 50 hovered just above the magic threshold of 5 percent. When some authors found a different way to calculate the statistics that pushed the p-value below 0.05 (3), the National Cancer Institute issued a press release that made national news. This behavior is especially touching because almost half the women in the "screened group" did not receive all the scheduled examina- tions a bias that overwhelms the meaning of the p-value. But there are other heuristics. Toward the other extreme is the common sentiment among practi- tioners that unless a technology has been proven not to be effective, it should be considered effective. The point is not that the heuristics are applied uniformly, simply that they are applied widely. The last set of heuristics is the most sweep- ing. To deal with the complex issues raised by costs and limited resources, we simply ignore costs and limits on resources. Our main concern is with the first heuristic, in which some biases are declared acceptable and others are not. Consider the implications of different points of view. To insist on seeing randomized controlled trial "proof" of effec- tiveness before approving a technology, and to not allow case-control studies, non-randomized controlled trials, analysis of data bases, or comparisons of clin- ical series (call this the "strict constructionist approach") is essentially saying that a patient selection bias is not acceptable (see Table 10.1~. However, when- ever a randomized controlled trial is taken at face value- for example, the results are analyzed by "intent to treat" without adjusting for crossover the implication is that crossover, errors in measurement of outcomes, and biases to external validity are either acceptable or somebody else's problems. Ironically, leaving it to decision makers to deal with biases to external validity implies an acceptance of clinical judgment as the preferred method to adjust for those biases. Now consider the implied set of beliefs at the other end of the spectrum. Those willing to make decisions on the basis of anecdotes and clinical series (let us call them "loose constructionists") are saying either that all the biases that affect those sources of evidence are acceptable, or that it is possible and appro- priate to adjust for them subjectively. For example, anyone who draws a con- clusion about alternative technologies by comparing separate clinical series of the technologies is either accepting patient section bias and a wide variety of other confounding factors or claiming an ability to adjust for them mentally. To summarize the main points about biases: Every design is affected by biases. Different designs are affected differently by different biases. And there is no way to escape subjective judgments in dealing with biases. The last point is especially important for what follows. Current evaluative methods rely on subjective judgments for such questions as which technologies require empirical evidence (e.g., drugs, devices, clinical procedures), what types of evidence are

28 DAVID M. EDDY acceptable, which outcomes must be demonstrated empirically, when intermedi- ate outcomes are acceptable, which intermediate outcomes to use, which patient indications require empirical evidence, how to extend results to other patient indications, which biases are acceptable, an acceptable a-level for determining statistical significance, and so forth. We can imagine that everything is purely objective, but subjective judgments are all around us. The question is not whether we allow the use of subjective judgments, but how we use them. Should they be implicit and informal, with every man for himself, or explicit, formal, organized, and open to review? OPTIONS Now let us return to the problem of choosing the best evaluative strategy. There are three main options: 1. accept the status quo with its inconsistencies and wide variations in degrees of rigor used by various approaches; 2. determine which of the current approaches is the most desirable, and move the other approaches toward that end of the spectrum. For example, we could make the strict constructionist approach more loose or the loose construc- tionist approach more strict. Or, 3. develop a new approach that combines the two extremes. To decide the merits of these three options, it is necessary to return to the objec- tive. It is to speed the acceptance and diffusion of technologies that are worth the costs and deserve priority, and to restrain technologies for which these con- ditions do not hold. The status quo (option #1) is highly variable in achieving this objective. We suspect that the strict approach is too slow; too expensive; discards some information from designs that, although not "perfect," are at least useful; and inhibits or at least retards the introduction of some effective tech- nologies. On the other hand, we suspect the loose approach is too subjective, too inaccurate, too arbitrary, and too hidden. It provides no information on the magnitudes of the outcomes; the conclusions can depend more on which experts you happen to choose than the merits of the technology; there is no trial, making it impossible to examine the logic of the judgments; and it appears to accept too many technologies that are in fact not effective. Furthermore, the basis for deciding which technologies need which types of evaluations seems arbitrary. Is there really any reason to believe there is something inherent about drugs ver- sus procedures that makes multiple randomized controlled trials necessary for drugs, but clinical series and clinical judgment best for procedures? It is diffi- cult to argue that the status quo is the appropriate choice. This has implications for the second option, picking one of the extremes and moving everything in that direction. If we believe the strict approach is too rigid, we do not want to move everything in that direction. Would we really

RULES FOR EVALUATING MEDICAL TECHNOLOGIES 129 want to require 49.5 feet of documentation on every technology, every indica- tion? Similarly, if we believe the loose approach is too loose, we should not trade in the virtues of rigor for it. The third option is to draw on the strengths of both. This approach, dare we call it the "flexible but few" approach, might proceed with the following steps. 1. Drop any preconceived conclusions about which experimental designs are acceptable or not, and which types of subjective judgments are acceptable or not. 2. Gather whatever empirical evidence exists, from any design. If a group is in the process of designing a new study to determine the effectiveness of a tech- nology, it is free to explore and submit any designs it chooses. 3. For each study, identify the potential biases. 4. Estimate the magnitudes of each bias (including, when appropriate, the range of uncertainty about the estimates). At this point we reach the main fork in the road. Traditionally, anyone evalu- ating evidence would make a judgment at this point about whether the biases are acceptable. If they are deemed unacceptable, the study, and all the informa- tion in it, is discarded. If the biases are considered acceptable, the study is admitted and from that time on the biases are assumed to be unimportant. Thus, the traditional choice is whether to take the study at no value, or to take it at face value. The flexible but few approach would take a middle course. It would use for- mal methods to adjust the results of the studies for biases, and use the adjusted results to make decisions about the merits of the technology. For example, if a randomized controlled trial has crossover, the traditional approach would ana- lyze the data by "intent to treat," which is tantamount to ignoring the bias. The flexible but few approach would not take the trial at face value but would esti- mate the proportions of people who crossed over, and adjust for the bias accord- ingly. If it is thought that patients who crossed over might not be representative of their group (e.g., they might be at higher risk of the outcome), that belief would be quantified and incorporated in the adjustment. Other biases could be addressed in similar fashion. Thus, the remaining steps for the flexible but firm approach are 5. Adjust the results of the studies for biases. 6. Use the adjusted results as the best information available for decisions. The hallmark of the flexible but firm approach is that it uses formal tech- niques (e.g., statistical models of biases) to incorporate focused subjective judg- ments (not global clinical impressions) to adjust (not simply accept or reject) results of studies to achieve the best combination of evidence and judgment. The philosophy behind it is that "one size doesn't fit all." The validity of a par- ticular design depends on the question being asked, the disease, the technology, the results, and the suspected biases, among other things. It is not an immutable property of the design.

130 DAVID M. EDDY AN EXAMPLE For an example, let us return to the two studies of breast cancer screening in women older than age SO. Suppose we are interested in the effectiveness of breast cancer screening in women age 50 to 64 with a combination of breast physical examinations and mammography delivered every two years (call this the circumstances of interest). The randomized controlled trial is affected by several biases. The main bias to internal validity is that about 20 percent of the group offered screening did not receive it (dilution), and about 5 percent of the control group received screening outside the trial (contamination). Potential biases to external validity are (a) the randomized controlled trial involved only mammography (not breast physical examination and mammography), (b) screening was delivered every three years (not two), and (c) the setting was a randomized controlled trial (not "usual cared. Adjusting for these biases requires estimating the magnitude of each bias. Suppose we estimate that dilution and contamination occurred in the percent- ages reported (20 percent and 5 percent, respectively), that the lack of breast physical examination and the longer frequency caused the observed results to understate the effectiveness of the combination of breast physical examination and mammography by about 40 percent (with a 95 percent confidence range of 30 percent to 50 percent) (4-7), and that the setting of the trial was natural enough not to affect external validity. Adjustment for these assumptions deliv- ers the estimated effect of breast cancer screening in the circumstances of inter- est shown in Figure 10.6. The randomized controlled trial taken at face value is included in the figure for comparison (dashed line). The other study was a case-control study. The main biases to which it is sub- ject are patient selection bias and errors of ascertainment of exposure to screen- ing. The external validity of the study is high because it involved a combination of breast physical examination and mammography delivered every two years in women age 50 to 64 under natural conditions. The investigators have provided information indicating that when screening was offered, those who chose not to get screened appeared to have an inherently worse prognosis after a cancer was detected (8~. Suppose we believe the rela- tive risk of breast cancer death in the women who declined screening, compared with those who accepted screening, was 1.4. Suppose also we believe that the methods for ascertaining who got screened in Me seven years prior to the analy- sis (e.g., chart review, records of screening centers, patient recall, and family recall) were subject to the following error rates. P("not screened" ~ screened, cancer) P("not screened" ~ screened, no cancer) P("screened" ~ not screened, cancer) P("screened" ~ not screened, no cancer) 15% 5% 5% 8% Under this set of beliefs, the probability distribution for the effectiveness of

RULES FOR EVALUATING MEDICAL TECHNOLOGIES Sweden unadj. / ~ 1 \ 1 1 Sweden adj. i \ it\ / \1 \ 1 \ \ / \ I\ \ \ \ I . \ \ / I J _' 131 0 0.5 2 ODDS RATIO FIGURE 10.6 Probability distributions for Swedish randomized controlled clinical trial taken at face value and adjusted for biases. Face value represented by dashed line, value adjusted for biases (see text) by solid line. breast cancer screening in the circumstances of interest is shown in Figure 10.7, which includes for comparison the study's results taken at face value, and the two distributions derived from the randomized controlled trial. This example could be made richer by incorporating uncertainty about any of the estimates of biases, by considering other potential biases that might affect the studies, by introducing and adjusting five other controlled studies of breast cancer screening for this age group (9-13), and by synthesizing the results of all the studies into a single probability distribution. It is important to understand that the estimation of biases should not be taken lightly. Ideally, bias estimates should be based on empirical evidence (e.g., data on potential biases should be collected during the conduction of a study) and impartial panels should review the assumptions. This example is intended to demonstrate a method, not promote particular numbers. The concept is that it is possible to improve on the current approach, in which data are either accepted at face value, rejected, or adjusted implicitly by pure subjective judgment. It is also important to understand that formal methods of adjusting for biases cannot and should not make every piece of evidence look good. For some studies, by

32 DAVID M. EDDY DOM unad;. Sweden unad;. \ ~ ~ DOM ad;. Sweden ad;. To 0.5 ODDS RATIO FIGURE 10.7 Probability distributions for Swedish randomized controlled clinical trial taken at face value and adjusted for biases (see text) and for DOM case-control study taken at face value and adjusted for biases (see text). Face value represented by solid line, value adjusted for biases (see text) by solid line. the time adjustments have been made, with honest descriptions of the ranges of uncertainty, there will be virtually nothing left. The distributions will be virtual- ly flat, providing no information about the effect of the technology. Many people will be uncomfortable with this example. Discomfort caused by disagreement with the specific adjustments is open for discussion. A panel might be appointed to determine the most reasonable assumptions, and the implications of a range of assumptions can be explored. If the discomfort is due to the attempt to incorporate any judgments at all in the interpretation of an experiment, remember the alternatives. If no explicit adjustments are made, the options are either to accept the study at face value (which violates our belief that it is biased), to reject it outright (which violates our belief that it contains some usable information), or to make the adjustments silently (which is more prone to error and closed to review). If the discomfort is simply because the approach is different, let it sink in for a while. I can already hear the complaints. From the rigorous side: "What a disaster! Admit the value of subjective judgments?! Tamper with the integrity of a ran-

RULES FOR EVALUATING MEDICAL TECHNOLOGIES 133 domized controlled trial by 'adjusting' it?! We've spent decades trying to make the evaluative process completely rigorous and clean- you're undoing decades of hard-fought rigor." From the other side: "Do you realize how much more work this would involve? You want us to wait for experimental evidence and actually describe and defend our beliefs? Don't you realize that medicine is an art, not a science?" The facts are that even in the strictest forums, subjective judgments already are an integral part of the interpretation of evidence, that the results of experi- ments already are adjusted, and that the choice of "accept at face value" versus "reject" is the grossest form of adjustment. The current system is not rigorous and clean; it is inconsistent and arbitrary. Remember that under the current sys- tem, about three-fourths of technologies go unevaluated by any formal means. At the same time, evaluative problems are too complex to be left to judgment alone. Subjective judgments should be used only after all the evidence has been exhausted, they should be highly focused, and they should be integrated with empirical evidence by formal methods. IMPLICATIONS FOR DECISION MAKERS The flexible but firm approach will modify the amount of work needed to collect and interpret evidence. For the people who produce the evidence, adop- tion of a new approach could either decrease or increase the research burden, depending on the current standard that must be met. Compared with the rigor- ous approach, it should be faster and simpler to gather evidence, because a wider variety of designs can be chosen from, and most of the new options are logistically simpler than the randomized controlled trial. But compared with the loose approach, the proposed approach will require considerably more research, more work to estimate the magnitudes of biases, and more work to perform the adjustments. The flexible but firm approach will generally make life more difficult for the people who interpret the evidence to make decisions, regardless of whether the current approach is rigorous or subjective. The reason is that in both cases the proposed approach eliminates the "take it or leave it" heuristic that so simplifies the interpretation of biases. It also eliminates use of the p-value to determine when the evidence is sufficient and whether the technology is appropriate (see Figure 10.4~. The proposed approach replaces these heuristics with a require- ment for explicit identification, estimation, and incorporation of biases. This requires more work, more documentation, and more exposure to criticism. CONCLUSION There is no doubt that a change in the evaluative techniques currently used by groups such as the Food and Drug Administration or by clinicians would

34 DAVID M. EDDY require major changes in the way we think about medical practices and in the types of evidence required to make decisions. There is also little doubt Hat we can ex~act more understanding from existing information, can gather sufficient evidence to make at least some decisions faster and less expensively, and can eliminate or at least narrow the glaring inconsistencies that now exist. The flex- ible but hum approach would allow us to cut back a bit on the strictness with which some technologies are evaluated, arid put more energy into increasing the rigor with which other technologies are evaluated. REFERENCES 1. Tabar L, Fagerberg CJG, Gad A et al. Reduction in mortality from breast cancer after mass screening with mammography. Lancet 1985;1:829-832. 2. Collette HJA, Rombach JJ, Day NE, DeWaard F. Evaluation of screening for breast cancer in a non-randomised study (The DOM Project) by means of a case-control study. Lancet 1984;1:1224-1226. 3. Chu KC, Smart CR, Tarone RE. Analysis of breast cancer mortality and stage dis- tribution by age for the health insurance plan clinical trial. Journal of the National Cancer Institute 1988;80: 1125-1132. 4. Eddy DM. Screening for Cancer: Theory, Analysis and Design. Englewood Cliffs, N.J.: Prentice-Hall, 1980. 5. Bailar JC III. Mammography: A contrary view. Annals of Internal Medicine 1976;84:77-84. 6. Shapiro S. Evidence on screening for breast cancer from a randomized trial. Cancer 1977;39:2772-2782. 7. Shwartz M. An analysis of the benefits of serial screening for breast cancer based upon a mathematical model of the disease. Cancer 1978;51:1550-1564. 8. DeWaard F. Collette HJA, Rombach JJ, Collette C. Breast cancer screening, with particular reference to the concept of "high risk" groups. Breast Cancer Research and Treatment 1988;11:125-132. 9. Shapiro S. Venet W. Strax P et al. Current results of the breast cancer screening randomized trial: The Health Insurance Plan (HIP) of Greater New York Study. In Screening for Breast Cancer. Toronto: Sam Huber Publishing, 1988: chapter 6. 10. Verbeek ALM, Hendriks JHCL, Holland R et al. Mammographic screening and breast cancer mortality: Age-specific effects in Nijmegen Project, 1975-1982. Lancet 1985; 1 :865-866. 11. Patti D, DelTurco MR, Buiatti E et al. A case-control study of the efficacy of a non-randomized breast cancer screening program in Florence (Italy). International Journal of Cancer 1986;38:501-504. 12. U.K. Trial of Early Detection of Breast Cancer Group. First results on mortality reduction in the U.K. trial of early detection of breast cancer. Lancet 1988;2:411~16. 13. Andersson I, Aspegren K, Janzon L et al. Mammographic screening and mortality from breast cancer: The Malmo Mammographic Screening Trial. British Medical Journal 1988;297:943-948.

Next: 11. Attitudinal Factors That Influence the Utilization of Modern Evaluative Methods »

Modern Methods of Clinical Investigation (1990)

Chapter: 10. Should We Change the Rules for Evaluating Medical Technologies?

Welcome to OpenBook!

Get Email Updates