Read "Improving Breast Imaging Quality Standards" at NAP.edu

Page 24 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

2
Improving Interpretive Performance in Mammography

Breast cancer is a significant cause of morbidity and mortality in the United States. Until it can be prevented, the best approach to the control of breast cancer includes mammography screening for early detection. Mammography, however, is not a perfect test, due to the complex architecture of the breast tissue being imaged, the variability of the cancers that may be present, and the technical limitations of the equipment and processing. The technical aspects of mammography are now less variable since the interim Mammography Quality Standards Act (MQSA) regulations went into effect in 1994. At this point, the focus is shifting to the quality of mammography interpretation. The available evidence indicates that interpretive performance is quite variable, but the ambiguities of human decision making, the complexities of clinical practice settings, and the rare occurrence of cancer make measurement, evaluation, and improvement of mammography interpretation a much more difficult task.

The components of current MQSA regulations pertinent to interpretive performance include: (1) medical audit; (2) requirements related to training, including initial training and Continuing Medical Education (CME), and (3) interpretive volume, including initial and continuing experience (minimum of 960 mammograms/2 years for continuing experience). The purpose of this chapter is to explore current evidence on factors that affect the interpretive quality of mammography and to recommend ways to improve and ensure the quality of mammography interpretation. The primary questions that the Committee identified as currently relevant to interpretive performance include whether the current audit procedures are likely to ensure or improve the quality of interpretive performance, and whether any audit procedures applied to the current delivery of U.S. health care will allow for accurate and meaningful estimates of performance. In addition, the Committee questioned whether the current CME and volume requirements enhance performance. These issues will be described fully and the current state of research on these topics will be described in the sections that follow. The current state of knowledge about existing measures and standards is described first in order to define the terms needed to assess the medical audit requirement of MQSA.

CURRENT STATE OF KNOWLEDGE REGARDING APPROPRIATE STANDARDS OR MEASURES

Effectively measuring and analyzing interpretive performance in practice presents many challenges. For example, data must be gathered regarding whether a woman has breast cancer diagnosed within a specified timeframe after a mammogram and whether the finding(s) corresponds with the location in which the cancer is found. Other challenges include reaching agreement regarding the definition of positive and negative interpretation(s), standardizing the patient populations so that comparisons are meaningful, and deciding which measures are the most important reflection of an interpreting

Page 25 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

TABLE 2–1 Terms Used to Define Test Positivity/Negativity in BI-RADS 1st and 4th Editions

ACR Category	BI-RADS Assessment
ACR Category	1st Edition	4th Edition
0	Need additional imaging	Need additional imaging evaluation and/or prior mammograms for comparison
1	Negative	Negative
2	Benign finding	Benign finding(s)
3	Probably benign	Probably benign finding—short-interval follow-up suggested
4	Suspicious abnormality	Suspicious abnormality—biopsy should be considered (4a, 4b, 4c may be included to reflect increasing suspicion)
5	Highly suggestive of malignancy	Highly suggestive of malignancy—appropriate action should be taken
6	NA	Known, biopsy-proven malignancy—appropriate action should be taken
SOURCE: American College of Radiology (2003).

physician’s skill. In this section, current well-established performance measures are reviewed and their strengths and weaknesses are discussed. These measures should be made separately for screening examinations (done for asymptomatic women) and diagnostic examinations (done for women with breast symptoms or prior abnormal screening mammograms) because of the inherent differences in these two populations and the pretest probability of disease (Dee and Sickles, 2001; American College of Radiology, 2003). However, for simplicity, in the discussion below “examinations” or “mammograms” are used without designating whether they are screening or diagnostic because the mechanics of the measures are similar in either case.

Before describing the measures, it is important to clearly define a positive and negative test. The Breast Imaging Reporting and Data System (BI-RADS) was developed by the American College of Radiology (ACR), in collaboration with several federal government agencies and other professional societies in order to create a standardized and objective method of categorizing mammography results. The BI-RADS 4th Edition identifies the most commonly used and accepted definitions, which are based on a standard set of assessments first promulgated by the ACR in 1992 and modified slightly in 2003. Table 2–1 outlines terms used to define test positivity/negativity as found in the 1st and 4th editions of BI-RADS.

The assessments are intended to be linked to specific recommendations for care, including continued routine screening (Category 1, 2), immediate additional imaging such as additional mammographic views and ultrasound or comparison with previous films (Category 0), short-interval (typically 6 months) follow-up (Category 3), or biopsy consideration (Category 4) and biopsy/surgical consult recommended (Category 5).

Page 26 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Based on these assessments and recommendations, definitions of a positive mammography interpretation have also been suggested by the ACR BI-RADS Committee, as follows:

Screening Mammography:

Positive test=Category 0, 4, 5

Negative test=Category 1, 2

Diagnostic Mammography:

Positive test=Category 4, 5, 6

Negative test=Category 1, 2, 3

MQSA regulations, in contrast, define a positive mammogram as one that has an overall assessment of findings that is either “suspicious” or “highly suggestive of malignancy.”

BI-RADS also now allows a single overall final assessment for the combined mammography and ultrasound imaging. Facilities that perform ultrasound online, at the time of diagnostic evaluation for an abnormal mammogram or palpable mass, will not have outcome statistics comparable to facilities where mammograms are reported without including the ultrasound evaluation. For example, a patient with a palpable finding may go to a facility and be found to have a negative mammogram and positive ultrasound, and the assessment will be reported as positive.

While there has been much improvement in mammography reporting since the adoption of BI-RADS, there is still inter- and intraobserver variability in how this reporting system is used (Kerlikowske et al., 1998). Some variability in calculated performance measures can, therefore, be attributed to variance among interpreting physicians on what constitutes an abnormal mammogram. Moreover, though the intent is clear, the linkage between assessment and recommendations is not always maintained in clinical practice. Indeed, Food and Drug Administration (FDA) rules require use of the overall assessments listed in Table 2–1, but the recommendations associated with each category are not mandated or inspected by FDA. Thus, considerable variability in recommendations exists. For example, 38 percent of women with “probably benign” assessments had recommendations for immediate additional imaging in one national evaluation (Taplin et al., 2002). Some analyses include Category 3 assessments associated with recommendations for performance of additional imaging as positive tests (Barlow et al., 2002). In addition, some women with mammograms interpreted as Category 1 or 2 have received recommendations for biopsy/surgical consult due to a physical finding not seen on the mammogram because mammography cannot rule out cancer (Poplack et al., 2000). Therefore, these standard definitions serve as a starting point, but in practice, adaptations may be needed to accommodate the reality of clinical care.

It is also important to define what constitutes “cancer.” In the context of mammography practice, the gold standard source for breast cancer diagnosis is tissue from the breast, obtained through needle sampling or open biopsy. This tissue sample then leads to the identification of invasive carcinoma or noninvasive ductal carcinoma in situ (DCIS). Breast cancers are labeled invasive because the cells are invading surrounding normal tissue. Invasive cancers account for most (80 percent) of breast cancers found at the time of screening in the United States. DCIS is included as a cancer diagnosis primarily because standard treatment for DCIS currently entails complete excision, similar to invasive cancers. Approximately 20 percent of breast cancer diagnoses are DCIS (Ernster et al.,

Page 27 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

TABLE 2–2 Possible Results for a Screening Test

		Cancer Outcome
		+	−
Test	+	TP—True positive	FP—False positive
Result	−	FN—False negative	TN—True negative

2002). Lobular carcinoma in situ (LCIS) also is occasionally reported in the tissue, but should not be counted as cancer because it is not currently treated.

Interpretive performance can also vary as a function of the time since the prior mammogram (Yankaskas et al., 2005). Recognizing that differences exist among screening guidelines regarding the appropriate screening interval (annual recommended by the American Cancer Society [ACS] and the American College of Obstetricians and Gynecologists [ACOG], every 1 to 2 years recommended by the U.S. Preventative Services Task Force [USPSTF]) (U.S. Preventive Services Task Force, 2002; Smith and D’Orsi, 2004; Smith et al., 2005), the specification of the period of follow-up after a mammogram is needed to observe women for the occurrence of cancer and calculate performance indices that can be compared in a meaningful way.

With the above definitions, it is possible to identify several measures of interpretive performance. The measures of performance available to assess interpreting physician’s interpretation all build from a basic 2×2 table of test result and cancer outcome as noted in Table 2–2. A one-year interval should be used to calculate the performance indices so that they are comparable. Standard definitions of these measures are well summarized in the ACR BI-RADS 4th Edition, and are highlighted here along with some of the strengths and weaknesses of each measure. Separation of the data of screening from diagnostic indications for mammography is absolutely essential if performance measures are to be meaningful.

Sensitivity

Sensitivity refers to the ability of a test to find a cancer when it is present [TP/(TP+FN)]. The challenge with this measure is determining whether a cancer has been diagnosed, particularly if a woman was given a negative mammogram interpretation. Those women are not necessarily seen back in the same facility for their next examination. Therefore it is not possible to know with certainty whether they have cancer or not. This problem is called verification bias. Because only those women sent to biopsy within a facility have their true cancer status known, verification bias may lead to an overestimation of sensitivity (Zheng et al., 2005). Relatively complete ascertainment of cancer cases can be expected only if a mammography facility is able to link its examinations to those breast cancer cases compiled in a regional tumor registry, and this is practical only for a very small minority (likely fewer than 5 percent) of mammography facilities in the United States.

Page 28 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Because the ultimate purpose of screening is to reduce disease-specific mortality by detecting and treating early-stage cancers, the sensitivity of mammography is important. However, sensitivity is affected by many factors, including whether it is a first (prevalent¹) mammogram or subsequent (incident) mammogram, the distribution of patient ages and tumor sizes in the population of women being screened by the interpreting physician, the length of time since prior mammograms, the density of the breast tissue among women with cancer, and the number of women with cancer found by an interpreting physician (Carney et al., 2003; Yankaskas et al., 2005). Most screening populations have between 2 and 10 cancers per 1,000 women screened, and among women undergoing periodic screening on a regular basis, the cancer incidence rate is 2 to 4 per 1,000 (American College of Radiology, 2003). Under current MQSA regulations, a single interpreting physician must interpret 960 mammograms over 2 years to maintain accreditation. If he or she is reading only screening (and not any diagnostic) mammograms, he or she may, on average, see two to four women with cancer per year. Estimating sensitivity among such a small set of cancers affects the reliability of the measures. Random variation will be large for some measures, making comparisons among interpreting physicians very difficult, even if the interpreting physician has complete knowledge regarding the cancer status of all the women examined. Because most interpreting physicians do not have that complete information (no linkage to regional tumor registry) or the volumes to create stable estimates, measurement of sensitivity will be of very limited use for individual interpreting physicians in practice.

Specificity

Specificity is the ability of the test to determine that a disease is absent when a patient is disease-free [TN/(TN+FP)]. Because most screened women (990 to 998 per 1,000) are disease free, this number will be quite high even if a poorly performing interpreting physician gives nearly every woman a negative interpretation. But interpreting physicians must interpret some mammograms as positive in order to find cancers, so false-positive examinations occur. Estimates of the cumulative risk of a false-positive mammogram over a 10-year period of annual mammography vary between 20 and 50 percent (Elmore et al., 1998; Hofvind et al., 2004), and the risk of a negative invasive procedure may be as high as 6 percent (Hofvind et al., 2004). High specificity of a test is therefore important to limit the harms done to healthy women as a result of screening. Although one study of nearly 500 U.S. women without a history of breast cancer found that 63 percent thought 500 or more false-positive mammograms per life saved was reasonable (Schwartz et al., 2000), the cost and anxiety associated with false-positive mammograms can be substantial. Studies have shown that anxiety usually diminishes soon after the episode, but in some women anxiety can endure, and in one study anxiety was greater prior to the next screening mammogram for women who had undergone biopsy on the previous occasion of screening compared with women who had normal test results (Brett and Austoker, 2001). One study has shown that immediate interpretation of mammograms was associated with reduced levels of anxiety (Barton et al., 2004).

¹

The prevalent screen refers to the first time a woman undergoes a screening test. Incident screens refer to subsequent screening tests performed at regular intervals. One useful index of screening mammography performance is that the number of cancers per 1,000 women identified by prevalent screens should be at least two times higher than by incident screens.

Page 29 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Like sensitivity, specificity is a difficult measure to obtain for most interpreting physicians because it requires knowing the cancer status of all women examined (linkage to a regional tumor registry). Because it is difficult to ascertain the status of all women who undergo mammography with respect to the presence or absence of cancer, it is important to be clear about who is being included in the measure and what the follow-up period is. This has led to three levels of false-positive measurement (Bassett et al., 1994):

FP₁: No known cancer within one year of a Category 0, 4, or 5 assessment (screening).
FP₂: No known cancer within one year of a Category 4 or 5 assessment (usually diagnostic).
FP₃: No known cancer within one year of a Category 4 or 5 assessment, for which biopsy was actually performed.

If each of these measures is estimated for a year, they can also be called rates. The limitation in choosing only one of the three rates is that there is a trade-off between the accuracy of the measure and the insight it provides regarding an interpreting physician’s performance. Although FP₃ involves the most accurate measure of cancer status, it reflects only indirectly on the interpreting physician’s choice to send women to biopsy. Interpreting physicians’ ability to make that choice, and to make the recall versus no-recall decision at screening, are important characteristics. The most accurate estimate of FP (FP₃) is therefore not necessarily the measure that provides the best insight into the interpreting physician’s performance. Conversely, FP₁ includes BI-RADS 0’s, a high percentage of which have a low index of suspicion. Furthermore, measuring FP₁ involves knowing the cancer status of all women for whom additional imaging was recommended (defined in BI-RADS as Category 0—incomplete, needs additional imaging). This is challenging because results of the subsequent evaluation may not be available. Currently, MQSA does not require that Category 0 examinations be tracked to determine the final overall assessment. The Committee recommends that for women who need additional imaging, mammography facilities must attempt to track these cases until they resolve to a final assessment. Although studies indicate that some interpreting physicians inappropriately assign women who need additional imaging a Category 3 BI-RADS assessment (Poplack et al., 2000; Taplin et al., 2002), this practice should be discouraged, and all women needing additional imaging should be tracked.

Positive Predictive Value (PPV)

There are three positive predictive values (PPV) that can be measured in practice, derived from the three false-positive measures described above. Again, these different measures are used to accommodate the challenges of data collection in practice. For example, though an interpreting physician may recommend a biopsy, it may not be done, and therefore the true cancer status may not be known. Thus, one must clearly state which PPV or PPVs are being monitored (Bassett et al., 1994), as recommended by the ACR.

PPV₁: The proportion of all women with positive examinations (Category 0, 4, or 5) who are diagnosed with breast cancer [TP/(TP +FP₁)].

Page 30 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

PPV₂: The proportion of all women recommended for biopsy after mammography (Category 4 or 5) that are diagnosed with breast cancer [TP/(TP+FP₂)].
PPV₃: The proportion of all women biopsied due to the interpreting physician’s recommendation who are diagnosed with cancer at the time of biopsy [TP/(TP +FP₃)].

MQSA requires that interpreting physicians have an established mechanism to ascertain the status of women referred for biopsy. With these data interpreting physicians can measure their PPV₂, but it is still subject to verification bias because not all women recommended for biopsy will have it done and because ascertainment of procedures is never 100 percent. The limitation of PPV₂ or PPV₃ is that many more women are referred for additional imaging (8 percent) than biopsy (1.5 percent) (Taplin et al., 2002). An important skill in interpretation involves sorting who needs additional imaging versus biopsy; PPV₂ and PPV₃ do not account for this because they only focus on women referred for biopsy. The ACR recommends that interpreting physicians who choose to perform one of the two types of audits described in the BI-RADS atlas should track all women referred for additional imaging for their subsequent cancer status (PPV₁) (American College of Radiology, 2003). Because measuring PPV₁ may not be possible in the absence of an integrated health system and registry, the Committee recommends use of PPV₂.

Another limitation of PPV that influences its usefulness is that it is affected by the rate of cancer within the population examined. The PPV will be higher in populations with higher cancer rates. For example, an interpreting physician practicing among older populations of women versus younger will have a higher PPV, just because the risk of breast cancer is higher among older women. PPV₁ will vary depending on the proportion of patients who are having an incident versus prevalent screen. Unfortunately, a high PPV does not necessarily correlate with better performance. For example, the interpreting physician who recommends biopsy for only larger, more classic lesions will have a higher PPV, but will miss the smaller, more subtle, and less characteristic lesions that may be more important to patient outcomes (Sickles, 1992). Therefore the Committee recommends measuring the cancer detection rate in addition to PPV₂ in order to facilitate interpretation of the measure. A higher PPV₂ should occur in a population with a higher cancer detection rate (see section below on Cancer Detection Rate).

Negative Predictive Value (NPV)

Negative predictive value (NPV) is the proportion of all women with a negative result who are actually free of the disease [TN/(FN+TN)]. Monitoring NPV is not a requirement of MQSA, and in practice, the NPV is rarely used because it involves tracking women with negative examinations (linkage to regional tumor registry is required).

Cancer Detection Rate

Cancer detection rate is the number of women found to have breast cancer per 1,000 women examined. This rate is meaningless unless screening mammograms are assessed separately from diagnostic evaluations. This measure is similar to sensitivity, but includes all examinations (not just cancer cases) in the denominator. The advantage is that interpreting physicians know the total number of examinations they have interpreted and can identify the cancers resulting from biopsies they recommended or performed.

Page 31 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

The disadvantage is that differences in the cancer detection rate may reflect not only differences in performance, but also differences in the rate and risk of cancer in the population served. A high cancer detection rate relative to other interpreting physicians may simply indicate that the interpreting physician is caring for an older population of women who are at higher risk for cancer, not that he or she is necessarily highly skilled at finding cancer. This difference can be mitigated by adjusting the cancer rate to a standard population age distribution if adequate numbers exist in each age group to allow rate estimates. For radiologists comparing their own measures over time, these kinds of adjustments are less important if the population characteristics are stable.

Other factors that could influence the cancer detection rate include the proportion of women having their first (prevalent) screen and the proportion having a repeat (incident) screen, the interval since the prior screen, differing practices with respect to who is included in screenings, whether practices read examinations individually as they are completed or in batches at a later time (mode of interpretation), and how long a physician has been in practice (van Landeghem et al., 2002; Harvey et al., 2003; Smith-Bindman et al., 2003). Interpretive sensitivity and specificity are higher on first screens compared to incident screens, presumably due to slightly larger tumors being found at prevalent screens (Yankaskas et al., 2005). For incident screens, the longer the time since the prior mammogram, the better interpretative performance appears, again because tumors will be slightly larger (Yankaskas et al., 2005). Some practices offer only diagnostic mammography to high-risk women with a history of breast cancer, while others will offer screening. Excluding such women from the screening population will reduce the number of cancers at the time of screening and affect positive predictive values, but may also change a physician’s threshold for calling a positive test. Changes in the threshold for a positive test can affect performance, and this threshold seems to change with experience (Barlow et al., 2004).

Abnormal Interpretation Rate

The abnormal interpretation rate is a measure of the number of women whose mammogram interpretation leads to additional imaging or biopsy. For screening mammography, the term “recall rate” is often used. The recall rate is the proportion of all women undergoing screening mammography who are given a positive interpretation that requires additional examinations (Category 0 [minus the exams for which only comparison with outside films is requested], 4, or 5). Desirable goals for recall rates for highly skilled interpreting physicians were set at less than or equal to 10 percent in the early 1990s (Bassett et al., 1994). This measure is easy to calculate because it does not rely on establishing the cancer status of women. The disadvantage is that differences in this measure may not reflect differences in skill except when the rate is extraordinarily high or low. Again, this will depend on the proportion of prevalent to incident screens (Frankel et al., 1995), on the availability of previous films for comparison (Kan et al., 2000), and on the mode of interpretation (Sickles, 1992, 1995a; Ghate et al., 2005).

Cancer Staging

Cancer staging is performed after a breast cancer is diagnosed. Stage, along with other tumor prognostic indicators (e.g., tumor grade, hormone receptor status, and other factors), is used to determine the patient’s prognosis, and the combination of tumor

Page 32 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

markers and stage influences treatment. Cancer staging takes into account information regarding the tumor histological type and size, as well as regional lymph node status and distant metastases. Staging information, which is generally derived from pathology reports in varying forms, is useful for the mammography audit because women with advanced, metastatic tumors are more likely to die from the disease. However, tumor staging information is not always easily available to the imaging facility, and thus, may be more of a burden to acquire.

Tumor Size

The size of the breast cancer at the time of diagnosis is relevant only for invasive cancers. All patients with only DCIS are Stage 0, despite the extent of the DCIS. An interpreting physician who routinely detects smaller invasive tumors is likely to be more skilled at identifying small abnormalities in a mammogram. The proportion of invasive tumors less than 1.5 or 1.0 cm could be used as one measure.

Using tumor size as a performance measure has several limitations; measurement of a tumor is an inexact science and may vary depending on what is recorded in a patient record or tumor registry (e.g., clinical size based on palpation, size based on imaging, size based on pathology), and who is doing the measuring. SEER (Surveillance, Epidemiology and End Results) registries use a hierarchy to choose which measurement to include. Heterogeneity will occur because not all measurements are available. Furthermore, the proportion of small tumors will be affected by the population of tumors seen by a given interpreting physician; for example, a physician reading more prevalent screens will have a greater proportion of large tumors because there are more large tumors in the population. The screening interval is also important when tumor size is used as a performance measure.

A shift toward smaller tumor size has been noted in screened populations such as those in the Swedish randomized trials of mammography (Tabar et al., 1992). A similar shift is expected in other screened populations. In one study of a National Screening Program, invasive breast cancer tumor size at the time of discovery decreased from 2.1–2.4 cm to 1.1–1.4 cm between 1983 and 1997, within which time period the national screening program had been implemented (Scheiden et al., 2001).

Axillary Lymph Node Status

The presence or absence of cancer cells in the axillary lymph nodes is one of the most important predictors of patient outcome. The prognosis worsens with each positive node (containing cancer cells) compared to women with histologically negative lymph nodes. Node positivity, however, is not necessarily a useful surrogate measure of an interpreting physician’s interpretive performance because inherently aggressive tumors may metastasize to the axillary lymph nodes early, when the tumor is still small, or even before the tumor becomes visible on a mammogram.

Page 33 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Area Under the Receiver Operating Curve² (AUC)

Interpreting physicians face a difficult challenge. While trying to find cancer they must also try to limit the number of false-positive interpretations. If the distribution of interpretations among women with cancer and women without breast cancer were graphed together on one x/y axis, it would look like Figure 2–1. Focusing on sensitivity simply indicates how an interpreting physician operates when cancer is present. Focusing on specificity simply indicates how an interpreting physician operates when cancer is not present. What is really needed to assess performance is the ability of the interpreting physician to simultaneously discriminate between women with and without cancer. This is

FIGURE 2–1 Ideal (A) and actual common (B) distribution of mammography interpretation (BI-RADS Assessment Categories 1–5).

²	For a more detailed description of ROC curves, see Appendix C in Saving Women’s Lives (IOM, 2005).

Page 34 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

reflected in the overlap between the two distributions of interpretations in Figure 2–1, and is measured by the area (AUC) under the receiver operating curve (ROC) (Figure 2–2).

ROC analysis was developed as a methodology to quantify the ability to correctly distinguish signals of interest from the background noise in the system. The ROC curves map the effects of varying decision thresholds and demonstrate the relationship between the true-positive rate (sensitivity) and the false-positive rate (specificity). If a reader’s interpretation is no better than a flip of the coin, the distribution of BI-RADS assessments in Figure 2–1 will overlap completely and the AUC in Figure 2–2 will be 0.5. If an interpreting physician has complete discrimination, the distribution of BI-RADS assessments will be completely separated for women with and without cancer, as in Figure 2–1a, and the AUC will be 1.0. An interpreting physician’s AUC therefore usually falls between 0.5 and 1.0.

Estimating the AUC is possible if the status of all examined women is known and the appropriate computer software is employed. It has the advantage of reflecting the discriminatory ability of the interpreting physician and incorporates both sensitivity and specificity into a single measure, accounting for the trade-offs between the two measures.

FIGURE 2–2 ROC analysis. If a reader is guessing between two choices (cancer versus no cancer), the fraction of true positives will tend to equal the fraction of false negatives. Thus, the resulting ROC curve would be at a 45-degree angle and the area under the curve, 0.5, represents the 50 percent accuracy of the test. In contrast, the ROC curve for a reader with 100 percent accuracy will follow the y-axis at a false-positive fraction of zero (no false positives) and travel along the top of the plot area at a true-positive fraction of one (all true positives). The area under the curve, 1.0, represents the 100 percent accuracy of the test. The hypothetical result for a reader with an area under the curve of 0.85 is shown for comparison.

Page 35 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

The disadvantages include the challenges of data collection and the requirement for somewhat sophisticated software to estimate the value of the AUC. Of note, however, is that ROC curves may be problematic when using BI-RADS terminology if interpreting physicians do not accurately use the full range of values in the ordinal BI-RADS scale (1, 2, 3, 0, 4, 5). Even when providers use the full scale accurately, the interpretations do not fall into a normal distribution across the range of interpretations. Most screening interpretations (79 percent) are BI-RADS 1. Despite this, BI-RADS interpretations can be analyzed directly with models for ordinal-level data (Tosteson and Begg, 1988). An underlying latent distribution can be assumed to generate a continuous ROC curve and area under the curve. This assumption regarding a latent distribution also requires an assumption about the normality of the latent distributions and different standard deviations for the women with and without cancer. Using widely available software, these assumptions can be accommodated and ROC analysis is routinely performed (Tosteson and Begg, 1988; Barlow et al, 2004; Yankaskas et al., 2005).

In summary, there is currently no perfect measure of performance, even under the best of circumstances where all the necessary data are collected. In practice, such a situation rarely exists. In addition, appropriate benchmarks for screening may vary depending on the unique populations served by a particular facility. Measuring and assessing performance in practice therefore constitutes a considerable challenge if the goal is accurate comparisons between facilities. If the goal is consistent feedback to the interpreting physicians within a facility, the limitations are not so great, because the data challenges may be more consistent within facilities and therefore the measurements more comparable. Given the challenges and limitations, the Committee recommends a focus on PPV₂. Calculating the cancer detection rate and the rate of abnormal interpretation (women whose mammogram interpretation leads to a recommendation for additional imaging or biopsy) would facilitate appropriate interpretation of PPV₂, which is influenced by the prevalence of cancer in the screening population. Evaluating these three measures in combination would enhance the current required medical audit of mammography considerably and should be feasible for mammography facilities to achieve. Measures such as sensitivity and specificity would be even more useful, but it would not be feasible to calculate these measures in community practices that do not have linkage with a tumor registry. Suggested changes to the medical audit of mammography are described in more detail in the section entitled Strategies to Improve Medical Audit of Mammography.

FACTORS AFFECTING INTERPRETIVE PERFORMANCE OF BOTH SCREENING AND DIAGNOSTIC MAMMOGRAPHY

Despite evidence that mammography screening is an efficacious technology for reducing breast cancer mortality among women in certain age groups (Andersson et al., 1988; Shapiro et al., 1988; Roberts et al., 1990; Frisell et al., 1991; Tabar et al., 1992; Elwood et al., 1993; Nystrom et al., 1993; Fletcher et al., 1993; Bassett et al., 1994; Schwartz et al., 2000; Fletcher and Elmore, 2003), its full potential for mortality reduction in practice may be limited by the accuracy of interpretation. For example, a low sensitivity may indicate missed opportunities in diagnosing early-stage breast cancers, when the potential to save lives is highest. On the other hand, a low specificity may

Page 36 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

TABLE 2–3 Recent Reports of Measures on Interpretive Performance of Screening and Diagnostic Mammography

Authors	Exam Type	Population	Years Studied	Sensitivity	Specificity	Cancer Detection Rate
Carney et al. (2003)	Screening	National sample (n=329,495)	1996–1998	75.0%	92.3%	4.8/1,000 (adjusted)^b
Kerlikowske et al. (2000)	Screening	National sample (n=389,533)	1985–1997	80.9%	—	4.2/1,000 (unadjusted)
Poplack et al. (2000)	Screening	New Hampshire (NH) women (n=47,651)	1996–1997	72.4%	97.3%	3.3/1,000 (unadjusted)
Poplack et al. (2000)	Diagnostic	NH women (n=47,651)	1996–1997	78.1%	89.3%	—
Yankaskas et al. (2005)	Screening	National sample (n=680,641)	1996–2000	70.9–88.6%^a	85.9–93.3%^a	3.2–6.1/1,000^a
Author	Exam Type	Population	Years Studied	PPV₂	Tumor Size	Cancer Diagnosis Rate
Sickles et al. (in press)	Diagnostic	National sample (n=332,926)	1996–2000	31.5%	20.2mm	25.3/1,000 (unadjusted)
^a Depending on months since prior mammogram (9–15, 16–20, 21–27, 28+). ^b Adjusted for patient characteristics in the screening population studied.

Page 37 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

indicate high rates of mammograms interpreted as abnormal, requiring additional workup when the woman actually does not have breast cancer.

The Committee was not asked to assess the current quality of mammography interpretation in the United States, but the available evidence indicates that interpretive performance is highly variable. There is a range in reported performance indices for mammography. Sensitivity and specificity rates for mammography screening trials range from 75 percent to 95 percent and from 83 percent to 98.5 percent, respectively (Roberts et al., 1990; Frisell et al., 1991; Tabar et al., 1992; Nystrom et al., 1993; Elmore et al., 2005). Table 2–3 lists the most current information on interpretive performance inscreening and diagnostic mammography. Different indices for performance are used relative to the type of studies done.

Variability is common in areas of medicine where observation and interpretation are subjective (Feinstein, 1985; Elmore and Feinstein, 1992). Several studies on variability in interpretive performance of mammography have been conducted with radiologists both in test situations (Elmore et al., 1994; Beam et al., 1996; Elmore et al., 1997; Kerlikowske et al., 1998) and in medical practice (Meyer et al., 1990; Brown et al., 1995; Kan et al., 2000; Yankaskas et al., 2001; Sickles et al., 2002; Elmore et al., 2002; Smith-Bindman et al., 2005). These have revealed that recall rates (the proportion of screening mammograms interpreted as abnormal with additional evaluation recommended) range from 3 percent to 57 percent among facilities (Brown et al., 1995) and 2 percent to 13 percent among individual radiologists (Yankaskas et al., 2001). Recall rates are higher and false-positive mammograms are more common in the United States than other countries, although the cancer detection rates are similar (Elmore et al., 2003; Smith-Bindman et al., 2003; Yankaskas et al., 2004). Less research has focused on the performance of diagnostic mammography, though one recent paper reported on women with signs or symptoms of breast cancer (Barlow et al., 2002). A PPV of 21.8 percent, sensitivity of 85.8 percent, and specificity of 87.7 percent was found.

Although general guidelines for performance have been put forth previously (Bassett et al., 1994), there is no consensus in the United States on minimal performance standards for interpretation, in part because there tends to be a trade-off between sensitivity and specificity, and there is no agreement on how many false positives should be acceptable in order to maximize sensitivity. In addition, the optimal performance standards will vary depending on a variety of factors such as the patient population being served. Patient factors that affect test accuracy include the size of the lesion, characteristics of the breast under examination (e.g., breast density, previous breast biopsies), patient age, extent of follow-up required to detect cancer, existence of previous exams, availability of prior films for comparison (Steinberg et al., 1991; Saftlas et al., 1991; Reis et al., 1994; Laya et al., 1996; Litherland et al., 1997; Persson et al., 1997; Pankow et al., 1997; Byrne, 1997; Porter et al., 1999; Mandelson et al., 2000; Buist et al., 2004), and time interval between screening examinations (White et al., 2004; Yankaskas et al., 2005).

Interpretive Volume and Interpreting Physicians’ Levels of Experience

Interpretive volume and interpreting physicians’ levels of experience (length of time interpreting mammography) have also been identified as important factors affecting breast cancer detection (Sickles, 1995a; Elmore et al., 1998; Beam et al., 2002; Esserman

Page 38 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

et al., 2002). Interpretive volume has recently received a great deal of attention, and it appears that when used in conjunction with other quality improvement strategies, higher volume may enhance interpretive accuracy. The findings and limitations of the several research studies discussed below are summarized in Table 2–4.

Perry (2003) described the UK National Health Program, in which there are minimum volume requirements that are much higher than in the United States: 5,000 mammograms interpreted per year per interpreting physician, and 9,000 screening mammograms performed per year per facility. Radiologists undertake a 2-week multidisciplinary course with specialist training at high-volume screening sites, which includes three sessions per week of interpreting screening mammograms. Radiologists additionally attend routine breast disease-related meetings and receive personal and group audit reports that include data on cancer detection rate, recall rate, and PPV₂. With all these combined activities, performance indices showed a reduction in the recall rate from 7 to 4 percent, and an increase in the small invasive cancer detection rate from 1.6/1,000 to 2.5/1,000.

Kan et al. (2000) studied 35 radiologists in British Columbia (BC), Canada, who work in the BC Mammography Screening Program. They derived a standardized abnormal interpretation ratio by dividing observed counts of the event by expected counts of the event. They found that abnormal interpretation ratio was better for readers of 2,000–2,999 and 3,000–3,999 per year compared to those interpreting less than 2,000 per year. These researchers concluded that a minimum of 2,500 interpretations per year is associated with lower abnormal interpretation rates and average or better cancer detection rates. Whether the findings from this small sample size from a program in Canada, where the qualifying standards for interpreting physicians are quite different, can be generalized to practice in the United States is not clear.

Another recent study from a population-based breast cancer screening program in Quebec showed that the rate of breast cancer detection was unrelated to the radiologist’s interpretive volume, but increased with the facility’s screening volume (Theberge et al., 2005).

A recent study that aimed to examine the relationship between reader volume and accuracy in the United States suggested that high volume readers performed better (Esserman et al., 2002). However, the study methodology included some artificial elements (e.g., it held specificity at a steady state and then recalculated each physician’s sensitivity, rather than studying the interpretive trade-offs between the two measures) that weaken the strength of the findings and conclusions.

In another study, performed within a major U.S. health maintenance organization (HMO) (Adcock, 2004), radiologists were provided with personal and group audit reports, attended case review sessions, participated in a self-assessment program, and were required to interpret 8,000 mammograms per year per radiologist (n=21 radiologists). The author reported that sensitivity improved from 70 percent to 80 percent, with a mean cancer detection rate of 7.5/1,000 and a mean recall rate of 7 percent, two other indices that improved significantly. However, the analysis was not published in a peer-reviewed journal; the report was primarily descriptive and is analytically limited (confidence intervals were not calculated), which may influence its accuracy. In addition, it is hard to know whether findings from 21 radiologists within a single HMO-based health care setting can be generalized to U.S. interpreting physicians in other diverse practice settings.

Page 39 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

TABLE 2–4 Summary of Recent Studies That Examine the Impact of Interpretive Volume and Experience on Accuracy

Author	Intervention/Evaluation/Volume Level	Population	Measures of Improvement
Author	Intervention/Evaluation/Volume Level	Population	Cancer Detection Rate	Recall Rate	Biopsy Info	Sens/Spec/AUC	Analytic Considerations
Perry (2003)	Audit with feedback, self-assessment program, and specialty training program with volume of 5,000/year	UK national sample (n=1,461,517)	Small invasive: 4.6/1,000^a Noninvasive: 0.5/1,000	4.0%	Benign Biopsy Rate 0.8/1,000	—	Significant improvements Limited generalizability Cannot isolate effect of volume
Adcock (2004)	Audit with feedback and case review with self-assessment program and volume of 8,000/year	Kaiser patients (n=101,000), 21 radiologists	From 6.3 to 7.5/1,000—all cancers combined	From 7.0% to 7.5%	PPV₂ from 31 to 37	Sens: from 70% to 80%	Limited number of radiologists (n=21) Cannot isolate effect of volume
Beam et al. (2003)	Examined the influence of volume after adjusting for other factors using multiple regression analysis	Random sample of 110 radiologists assessed using test set of 148 cases	—	—	—	On test set, mean Sens is 91% and Spec is 62%; neither volume nor years interpreting were found to influence accuracy	Case set not representative of usual practice (43% were cancers) Self-reported volume used as an input rather than studied with actual performance data

Page 40 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Author	Intervention/Evaluation/Volume Level	Population	Measures of Improvement
Author	Intervention/Evaluation/Volume Level	Population	Cancer Detection Rate	Recall Rate	Biopsy Info	Sens/Spec/AUC	Analytic Considerations
Barlow et al. (2004)	Assessed the relationship between radiology characteristics (years interpreting and volume interpreted) to actual performance	National sample (n=469,512 women) (n=124 radiologists)	—	10.4%	—	Adjusted AUC for number of mammograms interpreted: 0.92 for <1,000; 0.92 for 1,001–2,000; and 0.92 for 2,000+ (p=0.94)	Used actual practice data Higher volume was related to higher recall and Sens but lower Spec; both volume and years of experience affected criterion for calling mammo positive, but neither affected overall accuracy
Smith-Bindman et al. (2005)	Identified practice patterns and physician characteristics associated with the accuracy of screening mammography	National sample (n=1,220,046 women) (n=209 radiologists)	—	—	—	Mean Sens: 77%; mean false + rate 10%; interpretation of 2,500–4,000 per year with high screening focus had 50% fewer false+exams	Used actual practice data and included a focus on screening Included novel metric of accuracy

Page 41 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Author	Intervention/Evaluation/Volume Level	Population	Measures of Improvement
Author	Intervention/Evaluation/Volume Level	Population	Cancer Detection Rate	Recall Rate	Biopsy Info	Sens/Spec/AUC	Analytic Considerations
Kan et al. (2000)	Determined the relationship between annual screening volume and radiologists’ performance (in BC, Canada)	35 radiologists	—	—	—	Standardized abnormal interpretation was better for readers of 2,000–2,999 and 3,000–3,999 per year than those less than 2,000	Derived a standardized abnormal interpretation ratio by dividing observed counts of the event by expected counts of the event
^a Incident round screening performance. NOTE: AUC=area under the receiver operating curve, Sens=sensitivity, Spec=specificity, PPV=positive predictive value.

Page 42 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

The above studies are important, but notable limitations exist regarding the study of volume or experience alone because other confounding factors were included in the interventions. For example, in the Perry study the specific contribution of the higher minimum interpretive volume requirement was not isolated from other program activities in the analysis of improved performance. The same is true for Adcock’s study, where the specific contribution of interpretive volume versus other aspects of the intervention is unknown.

Some studies have been conducted in the United States that do examine interpretive volume alone, or in some cases, examine volume along with continuous experience. These are described below. Beam and colleagues (2003) used a test set of 148 mammograms, with 43 percent of the films having cancer, which was circulated to 110 randomly selected U.S. radiologists to assess the relationships between interpretive volume and accuracy. These researchers employed two different measures of accuracy, both using ROC analysis. The first was the area under the curve (AUC) estimated nonparametrically. This measure can be interpreted as the ability of the diagnostician to discriminate a mammogram showing breast cancer from one not showing breast cancer when two such mammograms have been randomly selected and presented together. Beam asserts the total AUC may not reflect the actual operating characteristics of radiologists because the full AUC includes high false-positive rates that are not relevant for screening. As a result, Beam has employed the use of partial AUC by restricting his analysis to the interval in which false-positive probability is less than 10 percent. This can be interpreted as the average sensitivity for the radiologist who reads within a clinically desirable range of false-positive values.

Briefly, they found that after controlling for the influence of radiologist- and facility-level factors, that neither interpretive volume nor years interpreting mammography was associated with screening accuracy. Rather, years since residency and having formal training in mammography during residency were both negatively associated with both of their ROC-based measures of accuracy, as described above. Several other factors were associated with one of the accuracy measures. Being the owner of the practice, increased use of diagnostic imaging and interventional procedures, and double reading were associated with increased accuracy, while presence of a computerized system to monitor and track screening, facility classification as hospital-based radiology department or multispecialty medical clinic (compared to breast diagnostic center, mammography screening center), and presence of a formal pathology correlation conference were negatively associated with accuracy. However, the Committee is not comfortable drawing conclusions about volume based on these findings alone. Because test sets have an extremely high percentage of abnormal films compared to usual clinical practice, data from “test” situations may be unreliable (Rutter and Taplin, 2000), although some work has suggested that giving specific instructions to reviewers prevents context bias in interpretive studies where images do not represent actual practice (Egglin and Feinstein, 1996).

Unfortunately two studies using data from clinical practice in similar populations in the National Cancer Institute’s (NCI’s) Breast Cancer Surveillance Consortium appear to show conflicting findings. In one, Barlow and colleagues (2004) studied 124 radiologists in 3 regions of the United States and found that increased radiologist experience was associated with a reduced recall rate and lower sensitivity but higher specificity. Using

Page 43 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

ROC curves to account for the trade-off between sensitivity and specificity, with additional adjustments for patient characteristics, these researchers found that both interpretive volume and extent of interpretive experience affected radiologists’ criteria for calling a mammogram positive, but overall accuracy of interpretation was not affected by either of these factors (interpretive volume and experience). These researchers concluded that direct feedback with audit results and focused training may result in more improved performance than increased volume or experience.

In the second study, Smith-Bindman and colleagues (2005) studied 209 radiologists and found an overall sensitivity of 77 percent (range 29 percent to 97 percent) and a false-positive rate of 10 percent (SD 5 percent; range 1–29 percent). They also found that more experience as a radiologist (25 years or more versus 10 years) was associated with lower false-positive rates, and an interpretive volume of 2,500–4,000 versus 481–750 was associated with a shift to a more accurate ROC curve after adjustment for both patient and radiologist characteristics. Using this technique, these researchers concluded that an annual interpretive volume of 3,000 screening mammograms per radiologist translated into 182 fewer false-positive mammograms and one missed cancer per year, though it does not show a significant improvement in their measure of accuracy (e.g., a new ROC curve). In fact, one table in the Smith-Bindman paper does indicate that overall accuracy is not influenced by volume with an odds ratio of 1.06, a finding that is not highlighted in the discussion of their results. No difference was shown in the odds of having a new ROC curve across the levels of volume. Thus, the Committee concludes that a recommendation to increase volume requirements cannot be justified based on this study. Smith-Bindman and colleagues’ modeling effort is innovative and intriguing, but its validity is not widely accepted.

The analytic methods used in the Smith-Bindman paper differ significantly from those used by Barlow et al., though the data sources are very similar and to some extent overlapping. Based on discussions with these investigators and a neutral biostatistician, Anna Tosteson, Sc.D., who is an expert in this field, the Committee concludes that there were reasonable arguments for each analytic technique, but that regardless of which method was chosen, neither showed a significant influence of volume on overall accuracy. More study is needed to establish the implications, advantages, and disadvantages of statistical approaches to evaluating the influence of volume on interpretive performance.

Factors that should be taken into account in reviewing often conflicting results of these studies include the type of analysis—ROC is stronger than sensitivity or specificity alone because the trade-offs between these two measures are accounted for and adjustments can be made for both patient and interpreting physician characteristics. Other factors to be considered include test versus practice-based evaluations, interpreting physician training and subspecialization (e.g., breast specialist versus general radiologist), and context of the reading. Contextual factors include whether all cancer data are ascertained, and the practice environment in which interpretation is taking place. For example, it is clear that practices in the United Kingdom vary substantially from those in the United States (Smith-Bindman et al., 2003).

Finally, the effect of changes in minimal reader volume on access to mammography services must be considered along with the potential effects on reader performance. As noted in more detail in Chapter 4, results from the recent ACR Survey of Radiologists

Page 44 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

of diagnostic radiologists suggest that raising the minimum reader volume to 1,000 every year would affect about 4,000 radiologists (25 percent of all practitioners), who accounted for approximately 6 percent of all mammograms interpreted in 2003. If the minimum limit of mammograms read were to increase to 2,000 every year, it would affect about 8,700 radiologists (54 percent of all practitioners), who accounted for approximately 23 percent of all mammograms interpreted in 2003.

In summary, a variety of approaches appear to offer benefits in improving physicians’ performance in interpreting screening mammograms, but investigators have not been able to demonstrate a clear relationship between volume alone and accuracy, or experience alone and accuracy. This finding is consistent with a report published by the IOM’s National Cancer Policy Board, which determined that a higher volume of care translates into improved short-term outcomes for certain complex treatments for cancer. However, the Board did not have evidence to support a broader application of volume recommendations to more common cancer treatments (IOM, 2001b).

The Committee discussed the potential impact of a modest increase in interpretive volume to 1,000 per year, and concludes that this increase alone was unlikely to change interpretive performance or to facilitate the ability of interpreting physicians to self-assess true-positive or false-negative interpretations. The requirement of 960 films/2 years was originally chosen with the intent of maximizing access, in the absence of any data to guide selection of a particular number. Given the uncertainty regarding the effect of reader volume alone, maintaining access should continue to be of primary concern because increasing the minimal reader volume could create access problems in some areas. Again, a combination of factors, most likely including helpful feedback, may be more effective in improving accuracy than volume alone.

Medicolegal Issues

There is some concern that medicolegal issues could also influence radiologists’ behavior. Failure or delay in breast cancer diagnosis continues to be the leading cause of medical malpractice claims in the United States (Physician Insurers Association of America, 2002; Berlin, 2003) with the amount of indemnity payments for breast cancer-related claims increasing significantly in the past decade (Records, 1995; Physician Insurers Association of America, 2002). However, whether malpractice concerns are driving the recall rate up in the United States has not been determined definitively.

A recent cross-sectional study conducted by Elmore et al. (in press) found that 72 percent of radiologists believed their concern about malpractice claims moderately or greatly increased their recall rate (recommendations for diagnostic mammograms and ultrasounds), while no radiologists responded that malpractice concerns decreased their recall rate. More than half (59 percent) also believed their concern about medical malpractice moderately or greatly increased their recommendations for breast biopsies, while no radiologists reported a decrease in recommendations for breast biopsies due to malpractice concerns. Though recall rates of the individual radiologists ranged from 1.8 percent to 26.2 percent, no statistically significant associations were noted between recall rates and reports of prior medical malpractice claims or other malpractice variables, perhaps because concern about malpractice was so uniformly high among the radiologists. The number of radiologists in the study with mammography-related malpractice claims during the 1996 to 2001 interpretation period was small (n=18), and the legal process for these

Page 45 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

claims often occurred over a long time period. Therefore, this study was not able to discern a direct effect of individual claims on recall rate.

DOUBLE-READING METHODS AND TECHNICAL TOOLS DESIGNED TO IMPROVE PERFORMANCE

Double Reading

One approach to improving interpretive performance is double reading. This approach may take several forms, but the two extremes include: (1) independent double reading where both readers interpret the films without knowledge of the other’s assessment and the most abnormal reading is acted upon, and (2) consensus double reading where both learn the other’s interpretation and resolve the differences together (arbitration). Between these two extremes are many blended forms where interpreting physicians may know each other’s interpretations and discuss differences, differences are resolved by a third party, or the second reader makes the final assessment. At least half of the organized programs in continental Europe and 88 percent of programs in the United Kingdom use double reading in some form, but in the United States the rate is lower (Shapiro et al., 1988). A recent study of community-based mammography practices showed that half (51 percent) of the surveyed screening facilities perform some type of double interpretation of screening mammograms; only 11 percent of the surveyed screening facilities perform double interpretations of all screening mammograms (Hendrick et al., 2005).

Research indicates that two individual interpretations (rather than one) capture a small but not insignificant number of breast cancers (6–15 percent) missed on single interpretation (Anttinen et al., 1993; Thurfjell et al., 1994; Hendee et al., 1999). However, some studies indicate that increased sensitivity may be accompanied by decreased specificity. In a review of 10 cohort studies of double reading, Dinnes et al. concluded that double reading increases cancer detection by 3–11/10,000 women screened and recall may actually decrease, if consensus arbitration is used (Dinnes et al., 2001). The issue of arbitration is important because acting on the most abnormal interpretation increased recall from 38 to 149/10,000 women. A study of arbitration by a panel of three radiologists who each independently read mammograms in cases where the two radiologists could not come to agreement increased recalls slightly, but still missed some cancers (Duijm et al., 2004). No studies have examined the effect of double reading on the interpretations of interpreting physicians over time, or subsequent breast cancer mortality. Double reading increases the costs/cancer detected by £1,162 to £2,221 (approximately $2,185 to $4,177) (Dinnes et al., 2001). It also increases workforce needs. However, double reading is not reimbursed by third-party payers.

Computer-Aided Detection (CAD)

Computer-aided detection (CAD) is another method used to supplement a single reader’s interpretation of screening mammograms. CAD can be performed on either standard film (analog) images or digitally acquired mammograms. CAD on analog images requires passing the films through a machine that creates a digital version of the images. The digital information is then analyzed by computer software that recreates the image on a monitor and flags areas of concern (e.g., clustered microcalcifications and masses)

Page 46 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

(Warren-Burhenne et al., 2000; Brem and Schoonjans, 2001). The interpreting physician reads the original films and then looks at an annotated copy of the digitized image. CAD is more likely to mark calcified lesions compared to masses and architectural distortions (Baker et al., 2003; Taplin et al., submitted). Most studies have counted CAD as true positive even if the algorithm marked a finding only on one of the two standard mammographic views. FDA approved the first CAD software in 1998 based on work demonstrating it would mark abnormalities not identified by radiologists (Warren-Burhenne et al., 2000) and it is now being used around the country.

It is important to note that cancers account for less than 1 percent of findings marked by CAD (Freer and Ulissey, 2001). It is up to the interpreting physician to determine if the markings represent actionable findings, and thus, the interpreting physician will routinely disregard many findings. The proper study of CAD, therefore, does not test whether a given lesion is marked by CAD, but rather, whether a given interpreting physician decided to ignore or act on the CAD mark.

Unfortunately, the two published studies of CAD outside a test setting present somewhat conflicting results (reviewed by (Elmore et al., 2005). Freer and Ulissey (2001) reported an increase of approximately 20 percent in cancer detection rate using CAD versus without the use of CAD. However, the study was done using two radiologists whose characteristics and experience were not reported. Lesions that were judged to require additional evaluation (recall) only because they were marked by CAD were classified as additional detections. The radiologists could only add workups for lesions marked by CAD, and had to act on their own calls even if CAD did not mark the lesion. Although that is the recommended way to use CAD, evidence from a test setting (not actual clinical practice) suggests that radiologists may not act on their own findings if CAD does not mark the lesion (Taplin et al., submitted). In the second published study of CAD in clinical practice, Gur and colleagues (2004) found no overall difference in cancer detection rates among breast imaging specialists in academic practice (cancer detection rate of 3.49/1,000 without CAD versus 3.55/1,000 with CAD, p=0.68). However, the subset of studied radiologists who interpret a relatively low caseload did increase their cancer detection rate by approximately 20 percent (3.05/1,000 without CAD versus 3.65/1,000 with CAD; p=0.37), similar to the result report by Freer and Ulissey (Freer and Ulissey, 2001; Feig et al., 2004). More information is needed about CAD in practice—with special attention to how such factors as interpreting physician experience, lesion characteristics, practice settings, and specific CAD algorithms affect CAD performance—before it can be concluded that it will generally improve interpretation. Studies performed in a test setting should be undertaken with a standard set of cases that were not used to train the various CAD systems being tested.

CAD is reimbursed by third-party payers. Adding CAD into clinical practice is not likely to substantially increase the workload of the interpreting physician, but time and equipment are needed to scan analog films. In comparison, double reading will impact the workforce by increasing the workload for interpreting physicians to a much greater degree.

CAD and Double Reading Combined

Two studies have evaluated markings by CAD on films read as negative by two independent radiologists. Sensitivity increased by 7 percent with CAD and 10 percent

Page 47 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

with double reading when the two approaches were compared (Karssemeijer et al., 2003). Both studies of CAD and double reading involved test sets and demonstrated that some missed lesions were marked by CAD, but the overall impact on practice was not evaluated (Destounis et al., 2004). Neither CAD nor double reading is addressed by current MQSA regulations.

THE IMPACT OF RESIDENCY/FELLOWSHIP TRAINING AND CME ON INTERPRETIVE SKILLS

The effectiveness of screening mammography greatly depends on the skills of the personnel interpreting the images. A portion of MQSA, consequently, addresses ways to ensure that physicians interpreting mammograms are adequately trained. Regulations stipulate that physicians must have received as initial training a minimum of 60 hours of documented medical education in mammography, and have interpreted at least 240 mammograms under the direct supervision of an interpreting physician. Board Certification or 3 months of training in mammography is also required (21 C.F.R. § 900.12, Quality Standards). Regulations do not require that the interpreting physician be a radiologist, but most are. There are no data to assess whether variations in interpretative performance exist due to medical specialty.

Some studies suggest that residency training in screening mammography is insufficient for adequate interpretation of mammograms when radiologists begin their first postresidency jobs. One study found that the perceptive and cognitive skills of radiology residents in interpreting a selected set of mammograms was equivalent to that of mammography technologists, and significantly lower than that of experienced, practicing radiologists (Nodine et al., 1999). Another study found that residents’ sensitivity in detecting cancer in screening mammograms was less than half that of general radiologists (Newstead et al., 2003).

Although both these studies were small, they suggest that residency training alone does not adequately ensure accurate interpretation of mammograms. Presumably, continuing experience in interpreting mammograms in clinical practice improves lesion recognition and analysis skills. But improved performance could also be fostered by appropriate CME programs designed to meet gaps in knowledge or skills. Such programs are also vital for physicians to keep abreast of the rapid advances in biomedical knowledge and evidence-based medicine that suggest needed changes in how they perform or interpret mammograms.

CME is offered by a number of institutions nationwide, including academic organizations and medical device manufacturing and pharmaceutical companies. CME is a time-based system that awards credits when health professionals attend educational conferences, workshops, or lectures relevant to medical practice. A certain number of CME credits are often required to receive medical re-licensure, hospital privileges, specialty recertification, and professional society membership (Bennett et al., 2000). MQSA requires all physicians who interpret mammograms to teach or complete at least 15 CME hours in mammography every 3 years. It also stipulates that physicians must have an additional 8 hours of training in the use of any new mammographic modality (i.e., digital mammography) for which they have not previously been trained (21 C.F.R. § 900.12, Quality Standards).

Page 48 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Numerous studies have shown that, in general, CME programs enhance the performance of physicians. A synthesis of 99 studies found that most (70 percent) CME programs fostered positive changes in professional practice (Davis et al., 1995). Another research synthesis of three studies found CME programs improved the knowledge, skills, attitudes, and behavior of health professionals, and also improved patient health outcomes (Robertson et al., 2003).

How effective CME programs are at improving physicians’ performance depends on how they are structured. Recent studies reveal that programs that offer an opportunity for attendees to interact with their educators and practice the skills learned are more effective than traditional didactic lectures. Such opportunities for interaction include case discussions, role playing, and hands-on practice sessions. One review of 10 studies of CME interactive programs (in fields other than mammography) found that attendance at 7 of these programs was linked to statistically significant improvements in professional medical practice and/or health care outcomes. In contrast, the same review examined seven studies of traditional lecture-centered CME presentations and found that only one led to statistically significant improvement in medical practice and/or health care outcomes following attendance (Thomson-O’Brien et al., 2004). Although this suggests interactive programs are more effective than didactic ones, the researchers pointed out they found only one study that directly compared a didactic presentation with an interactive workshop. This study had inconclusive results.

Another research review (again, not involving mammography) concluded that “continuing education that is ongoing, interactive, contextually relevant and based on needs assessment is more likely to improve knowledge, skills, attitudes, behavior, and patient health outcomes” (Robertson et al., 2003). The importance of physicians recognizing the need to change their behavior, knowledge base, or skills was underscored by a study that found physician performance improved when learning experiences incorporated tests of knowledge and assessments of clinical practice needs (Davis et al., 1992). Another non-mammography-related review found needs assessment positively affected physician performance in four of five studies (Davis et al., 1999).

Other non-mammography-related studies suggest additional factors may be needed to supplement CME programs. These factors include practice-enabling strategies, such as patient education materials and office facilitators, and reinforcing methods such as feedback and physician reminders to support physicians’ ability to change an aspect of their practice (Davis et al., 1995). There is also some evidence for the theory that the peer group plays an important role in fostering or impeding the adoption of new information. This suggests that having all or most physicians at an institution attend the same CME program might create a “critical mass” of trainees to support new approaches (Davis et al., 1992, 1995; Robertson et al., 2003).

More specifically relevant to mammography, a comprehensive mammography audit of 12 radiologists in a group practice (performed before MQSA was enacted), revealed that following attendance at a 3- or 4-day mammography CME course, the radiologists detected a statistically significant 40 percent increase in numbers of cancers, with only a 6.5 percent increase in caseload (Linver et al., 1992). However, this 1992 study primarily involved radiologists who had never before had mammography CME. Insofar as all practicing interpreting physicians have been required by MQSA regulations to obtain substantial amounts of CME since 1994, these results do not address the ability of

Page 49 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

mammography CME to provide incremental improvements in performance for interpreting physicians who have already had considerable CME experience.

The only available study involving relatively current CME experience shows a more modest improvement in radiologist performance. This study involved only 23 practicing general radiologists who attended a one-day CME lecture course on using the BI-RADS interpretation system (Berg et al., 2002). When given a selected set of mammograms before and after taking the course, the radiologists showed modest improvements in their analysis of lesion features, final assessment of cases, and recommendations to biopsy those lesions that proved to be malignant.

Thus, although the data on the effect of CME in general suggest effectiveness in improving performance, there is a paucity of data suggesting clinically relevant effectiveness of mammography CME in the current U.S. environment, in which MQSA regulations already require a large amount of CME.

In summary, the existing literature is insufficient to demonstrate either the effectiveness or lack of effectiveness of specific approaches to resident/fellowship training or specific CME course content in improving mammography interpretive skills. Thus, the Committee recommends that before establishing an MQSA-mandated requirement for CME specifically dedicated to mammography interpretive skills, there is need to demonstrate the value of this approach. Funding should be provided for comprehensive research studies on the impact of various existing and innovative teaching interventions on mammography interpretive skills.

THE INFLUENCE OF SKILLS ASSESSMENT AND FEEDBACK ON PERFORMANCE

Overview

Theoretically, assessment via medical audits is designed to link practice patterns to patient outcomes in a way that can influence provider behavior and performance. No studies have been done to determine whether mammography outcomes monitoring alone is effective, but in other areas of medicine, there are conflicting reports in the literature about the effectiveness of audits (Weiss and Wagner, 2000). However, a review of studies on the audit and feedback approach found it can be effective in improving professional practice, particularly when baseline adherence to recommended practice is low (Jamtvedt et al., 2003). Another systematic review found audit with feedback was more consistently effective when feedback was delivered in the form of chart review (Davis et al., 1995). A benchmarking approach, in which physicians can compare their personal performance with that of top performers in a peer group or assess if their practice conforms to accepted practice guidelines, also improved the effectiveness of physician performance in ambulatory care (Kiefe et al., 2001).

The majority of research using medical audits for physician self-assessment has been done in primary care to understand resource use and management of medical conditions (Cave, 1995; Roblin, 1996; Spoeri and Ullman, 1997; Greenfield et al., 2002). Ross et al. (2000) found that physician audits for specific diagnosis-related groups resulted in significant reductions in hospital lengths of stay. The Ambulatory Care Medical Audit Demonstration Project (Palmer and Hargraves, 1996), the largest formal study of the use

Page 50 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

of audit information, was a randomized controlled trial to test the impact of peer-comparison audits along with other intervention strategies (including assistance with processing audit reports). Audits had a significant impact on the quality of care in monitoring hematocrit in anemic patients; performance of annual Pap test and clinical breast examination; follow-up of serum glucose in diabetics; and monitoring of patients on digoxin (Palmer and Hargraves, 1996). Medical audits and clinical prompts seem to be most effective when introduced at the point of patient care (Palmer and Hargraves, 1996). Weiner and colleagues (1995), using Medicare claims data to profile the care provided to diabetics, showed that even with adjustment for case mix of patients and characteristics of physicians, as many as 84 percent of patients were not receiving recommended care, such as hemoglobin A1c monitoring.

Adjustment for case mix of patients and characteristics of physicians is very important when profiling physician performance (Greenfield et al., 2002). For mammography, performance data on individual interpreting physicians may be misleading without adequate consideration of patient and physician characteristics (Elmore et al., 2002). Such adjustments reduce the noted variability in radiologist performance in mammography by approximately one-half (Elmore et al., 2002), as illustrated in Figure 2–3.

FIGURE 2–3 Results of statistical modeling for unadjusted (Line A) and adjusted (Line B for patient characteristics, C for radiologist characteristics, and D for both patient and radiologist characteristics) false-positive rates for 24 radiologists in a community setting. The variability in false-positive rates decreases with such adjustments.

SOURCE: Reprinted from Elmore et al. (2002) by permission of Oxford University Press and the Journal of the National Cancer Institute.

Page 51 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Therefore, physician profiling to assess performance can lead to incorrect and possibly dangerous conclusions without paying careful attention to adjustments for differences in patient characteristics (e.g., age and breast density). Because of the importance for adjustment in patient characteristics, and the small number of cancer cases seen by physicians each year, there is a need to develop appropriate statistical models and interfaces for use by clinicians in practice.

Landon et al. (2003) discuss numerous obstacles to the implementation of performance assessment programs, and propose standards for enhanced evaluation. They suggest ideal performance measures be established and standardized, evidence based, feasible to collect, representative of the activities of the specialty, adjusted for confounding patient factors, and applicable to an adequate sample size of patients to facilitate valid analysis. Unfortunately, evidence-based measures do not exist for each specialty, and it may not be possible to use a similar assessment program for each field. In addition, the widespread data collection necessary for adequate programs is costly, and current infrastructure is not capable of supporting it (Landon et al., 2003).

Examples of Mammography Audit and Quality Improvement Programs

Several other countries have rigorous quality assessment and improvement programs as part of their national breast cancer screening programs. These all involve centralized large-scale screening programs, so the effectiveness of such approaches has not been tested in the community practice setting in which most mammography is provided in the United States. Nonetheless, a review of these programs might prove instructive.

The United Kingdom’s National Health Service Breast Screening Program (NHSBSP) sets highly specific national quality assurance standards for mammogram interpretation, and regularly monitors adherence to its standards by a quality assurance network. This network includes regional professional quality assurance coordinators who meet regularly with radiologists in their region to review performance and outcomes of mammography screening, to share good practice, and to encourage continued improvements (IOM, 2005). Radiologists are required to rotate through screening and diagnostic clinics and participate in all activities of the breast care team, including multidisciplinary meetings (National Radiographers Quality Assurance Coordinating Group, 2000).

A supportive environment has been essential for the successful quality improvement of this program (Perry, 2003). Peer review and self-audit foster an environment of learning, rather than blaming, which is thought to be a key strength of the NHSBSP (Perry, 2004a). Annual audit results for the program are public domain and are disclosed to patients; individual performance results are provided only to the radiologist (Applied Vision Research Institute, 2004). In the unusual event that an individual or unit fails to meet these standards, a series of remedial actions are undertaken (Perry, 2003). Interpreting physicians, who undergo specific training in order to participate in the NHSBSP, are reported to be satisfied with these monitoring and review processes.

Australia’s national mammography screening program is also known for its uniform interpretation standards and rigorous monitoring to ensure its physicians comply with those standards. Participating radiologists are required to read 2,000 screening mammograms a year, and their performance is measured against a set of standards for

Page 52 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

cancer detection, early-stage cancer detection, and recall rates. Each mammography facility is required to inform radiologists annually of their performance compared to the minimum set standards, as well as compared to their peers. Mammogram readers are also given at least quarterly feedback on cancers that they did not recall (Kossoff et al., 2003).

If the detection rate for a radiologist falls below the 95 percent confidence interval of the benchmark rate, each Australian mammography facility has a designated radiologist who must implement a strategy to improve that individual’s performance. This strategy targets the specific weaknesses revealed by a detailed evaluation of the radiologist’s performance, and may require the radiologist to attend additional training courses, assessment clinics, or have his or her interpretations regularly reviewed by a more experienced reader. Alternatively, a change in work practices may be instituted, such as switching to more optimal work times, or limiting the number of films read per session. Underperforming radiologists are required to adhere to the plan for improvement, and their performance is monitored closely for 2 years after starting the plan (Kossoff et al., 2003).

The Netherlands’ national breast cancer screening program also relies on a national system for quality control and monitoring. This system collects data on true-positive rate, false-positive rate, positive predictive value, and cancer detection rate, and evaluates the data in aggregated form for every central screening facility that interprets mammograms. In addition, a small contingent of experts from the National Training and Expert Center for Breast Cancer Screening conducts onsite audits of every interpretation facility once every 2 to 3 years. These audits involve reviewing interval and screen-detected cancer cases to realistically assess the false-negative rates of the facility. A report prepared after the audit includes suggestions for possible improvements. When necessary, a screening facility is prompted to make improvements. Gradual improvements in screening parameters were noted in second audits of facilities compared to first audits, including higher detection rates and lower false-negative rates (van der Horst et al., 2003). Mammography screening programs in Sweden and in several Canadian provinces also have high performance standards (IOM, 2005).

In some circumstances, such quality improvement programs can be implemented in the United States as well. For example, the interpretation of mammograms improved following the institution of an extensive quality improvement program at Kaiser Permanente (KP) Colorado. Begun in 1998, this program created a centralized facility for reading mammograms, in which radiologists had access to specialized training, were required to participate in self-assessments three times a year, and had to read a high volume of mammograms (Adcock, 2004). The performance of individual radiologists was continually monitored with a number of measures, including proportion of early-stage cancers detected, sensitivity, recall rate for screening mammograms, positive predictive value, and diagnosis of new, probably benign lesions. These measures were derived from the KP Colorado Tumor Registry, from reports of mammogram results, and from Radiology Information System extracts. The measures were compared to published benchmarks, and radiologists received feedback on their results. Performance gaps were analyzed and targeted with specific interventions, such as securing second opinions from another radiologist until performance improved (Adcock, 2004). Within a few years after instituting this quality assurance program, there was a statistically significant increase in the sensitivity for cancer detection and in the proportion of early-stage cancers detected, without an in-

Page 53 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

crease in the number of recalls. Improvements in efficiency have also produced substantial cost savings; for example, by 2004, the professional component of the cost per mammogram interpreted by the group had declined to $28 (77 percent of the Medicare benchmark). In a recent survey of radiologist satisfaction (which did not distinguish between mammography subspecialists and other radiologists), 15 of 16 anonymous respondents agreed that if they had the opportunity to revisit their choice to join the group, they would do so again.

CHALLENGES TO USING MEDICAL AUDIT DATA TO IMPROVE INTERPRETIVE PERFORMANCE IN THE UNITED STATES

A Lack of Data and Information to Guide Audit and Feedback

U.S. radiologists appear not to be aware of their own interpretive performance levels, but they need to know and understand what their current levels of accuracy are and understand what this means before they can determine whether and how to improve. Both patients and clinicians are often confronted with a baffling array of percentages and probabilities related to mammography. Research indicates that some individuals, including practicing clinicians and highly educated participants, experience difficulty in understanding rates, risks, and proportions (Gigerenzer, 2002). This is illustrated in Figure 2–4, which shows that more than 90 percent of radiologists overestimated a woman’s 5-year risk of breast cancer based on a patient vignette. The use of frequencies (rather than probabilities), visual aids, and individualizing the data may improve clarity and understanding about mammography performance.

FIGURE 2–4 Radiologists’ perceived 5-year risk of breast cancer for a vignette of a 41-year-old woman whose mother had breast cancer, who had one prior breast biopsy with atypical hyperplasia, and who was age 40 at first live birth.

SOURCE: Reprinted with permission from Egger et al. (in press).

Page 54 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

However, as noted previously, the estimates of accuracy at the level of individual interpreting physicians are subject to variation and may not adequately describe interpretive performance for individual physicians. For example, sensitivity calculations may not be reliable for physicians interpreting only 960 films every 2 years, because in a given year, they may see few or even no films of women with breast cancer. This problem could potentially be overcome by using well-characterized test sets for self-assessment, although data from “test” situations may be unreliable (Rutter and Taplin, 2000) because the conditions under which such testing takes place may be quite different from those used in conventional practice. For example, a nationwide test would likely need to be implemented using digital images because of the difficulty and expense of circulating films, but the majority of mammograms in the United States are done on film, so the viewing conditions for the test may be inconsistent and different from common practice. Also, test sets are heavily weighted with cancers compared to the usual screening population, leading to higher than normal recall rates.

A few self-assessment programs already exist for mammography, but are not widely used in the United States (Box 2-1). The ACR has several self-assessment programs called Mammography Interpretive Skills Assessment (MISA) (Sickles, 2003). There is no requirement for interpreting physicians to use this assessment program, and most do not. This is in stark contrast to the 90 percent of British interpreting physicians who use a similar mammography self-assessment program called PERFORMS. The Screening Mammography Program of British Columbia uses an interpretive skills test as an acceptance test for screener candidates (Warren-Burhenne, 2003).

The Committee does not recommend mandatory proficiency testing via self-assessment exams for all interpreting physicians at present because the available testing procedures have not been rigorously evaluated and proven to have a direct positive impact on interpretive performance in clinical practice. The ACR’s MISA exam has been evolving over the past 12 years and was not designed to be sufficiently rigorous to permit valid assumptions or inferences regarding actual performance of an individual examinee in clinical practice (Sickles, 2003). Certain steps in test validation that would be required for legal defensibility if licensing were intended as the primary purpose of the examination have not been undertaken (Sickles, 2003). Furthermore, the MISA test sets, which currently include less than 30 cases, most of which contain cancers, would need to be greatly expanded and frequently refreshed if they were to be widely used for proficiency testing. Although these are not insurmountable obstacles, the time and costs associated with further development and validation would be substantial. Thus, the Committee recommends pilot projects be undertaken within breast imaging Centers of Excellence (described in the section entitled “Breast Imaging Centers of Excellence”) to test the value and feasibility of proficiency testing.

Page 55 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

BOX 2–1
Mammography Self-Assessment Programs

The American College of Radiology (ACR), the United Kingdom’s National Health Service Breast Screening Programme (NHSBSP), and the Screening Mammography Program of British Columbia (SMP-BC) have developed self-assessment programs for the interpretation of mammograms. The ACR’s Mammography Interpretive Skills Assessment (MISA) offers two CD-ROMs that show the mammograms of 28 and 29 cases, respectively. The radiographic images can be magnified and panned on the computer screen, and the location of abnormal findings on each displayed image can be identified with a mouse click. The images are accompanied by multiple-choice questions that test important aspects of breast imaging practice. The programs also provide instant feedback and text explanations for correct and incorrect responses.

The Personal Performance in Mammographic Screening (PERFORMS) program put out by the NHSBSP is more comprehensive. Each year this program offers 2 film sets, each with 60 two-view cases. The feedback given to radiologists who evaluate the films is extensive; the radiologists are not only informed of how many malignant cases they missed, but whether those missed cases showed any patterns, such as the presence of dense tissue or multiple microcalcifications. The program also provides details concerning the cases a radiologist incorrectly recalled for further assessment (false positives).

In addition, particular film sets also allow the individual to see a large number of examples of one particular abnormality, and have been shown to improve radiologists’ detection of these specific features. Additional advanced training sets enable radiologists to concentrate on the types of cases they are most likely to misinterpret. The participating radiologists are also informed of how they performed in comparison with their anonymous colleagues. An individual’s results are anonymous and are made available only to the radiologist who takes the test.

The SMP-BC test set includes about 100 cases, of which one-third to one-half contain malignant or premalignant lesions of varying conspicuousness. The sensitivity and specificity of each reader is calculated in a case-based and breast-based manner, and acceptance as a screener depends on performance. A minimum threshold is set for both sensitivity and specificity, and all obvious cancers must be identified. For active screeners who do not meet the minimal criteria, additional training is provided and double reading with an approved radiologist is required until the test is retaken and passed.

SOURCES: National Health Service (2003); Gale (2003); Wooding (2003); Sickles (2003).

Lack of a Centralized Source of Information for Outcomes Data

The vastly decentralized system of health care in the United States is a large reason for the great variability in the methods of collecting and using mammography data that are described in the previous sections. As noted above, mammography programs in Canada, the United Kingdom, and other European countries have the benefit of national or regional surveillance data systems in which a centralized data repository is used to capture accurate data and feed it back to radiologists and facilities. These systems allow for the calculation of medical outcome audit measures by a centralized entity, thereby limiting the variability in data definitions and calculation methods that complicate efforts

Page 56 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

in the United States. Such surveillance systems also provide more complete case ascertainment for calculation of measures such as sensitivity. In the United States, sensitivity and specificity data cannot be collected unless periodic linkage to data in a regional tumor registry can occur, precluding use by the vast majority of mammography facilities.

There are a few large surveillance systems in the United States that demonstrate the feasibility of medical outcomes monitoring. The National Breast and Cervical Cancer Early Detection Program (NBCCEDP), sponsored by the Centers for Disease Control and Prevention, maintains a large data system for tracking breast and cervical cancer screening provided for uninsured women across the country (Henson et al., 1996). Although this has been used to conduct quality assurance activities in New York state, not all mammography facilities participate in the NBCCEDP, and the underserved population screened through the program has important demographic differences and prior screening history that may limit its generalizability (Hutton et al., 2004).

The NCI-sponsored Breast Cancer Surveillance Consortium (BCSC) has successfully linked screening mammogram data with population-based cancer registries in seven regional areas of the United States: North Carolina, Colorado, Seattle (Group Health Cooperative), New Hampshire, New Mexico, San Francisco, and Vermont. Established in 1994 to study breast cancer screening in the United States, the consortium’s database contains information on millions of screening mammograms. Each mammography registry sends its data electronically to a centralized Statistical Coordinating Center for pooled analyses, and is linked to cancer registries to enable the determination of predictive value, sensitivity, and specificity of mammography, as well as practice patterns (Ballard-Barbash et al., 1997). Although the BCSC could provide benchmark data useful for an audit of a mammography facility (Ballard-Barbash et al., 1997), its mechanisms for encrypting data preclude the ability to identify individual performance of interpreting physicians or facilities, and its focus on the facilities precludes its use as a national repository in its current format. Although the individual registries that contribute data to the BCSC do collect radiologist-specific and facility-specific data that could be used for quality improvement, the BCSC was not intended to be used for quality assurance purposes and facilities may be less likely to participate in the BCSC if data are used for purposes other than research. Each registry undertakes audits for participating facilities. Several medicolegal protections have been employed to prevent forced disclosure or uses of BCSC site data for medicolegal purposes, as outlined by Carney et al. (2000).

In contrast to the BCSC, which is primarily research oriented, the National Consortium of Breast Centers, Inc. is planning to collect data for the specific purpose of quality improvement (National Consortium of Breast Centers, Inc., 2004). The group is developing a set of core measures to define, improve, and sustain quality standards in comprehensive breast programs and centers. Participation is voluntary, and centers can contribute data using a unique identifier code to a large database on a dedicated server. The initial goal is to identify benchmarks for services and procedures within breast centers through standardized data input and statistical review and analysis. To start with, the project will only inform a participating center how it compares to the benchmark. Each center will then self-evaluate and develop improvement plans.³

³	Personal communication, D.Wiggins, National Consortium of Breast Centers, Inc., January 17, 2005.

Page 57 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Examples from other areas of medicine in the United States also could be informative. Regular quality assessments and appropriate feedback following audits is thought to underlie the improvements seen in the outcomes of surgeries conducted at Veterans Administration (VA) hospitals. Since 1994, the VA has operated a National Surgical Quality Improvement Program (NSQIP) in which all medical centers performing major surgery participate. Data on postoperative mortality and morbidity are collected at each of these facilities and analyzed at two centers. The expected outcomes are determined based on the risk factors of the population treated, and the hospital is rated according to how closely it matches those expected outcomes. These outcomes are reported to the facilities each year. Hospitals with lower than expected outcomes are provided with self-assessment tools and site visits to help them identify and address deficiencies in the quality of care they deliver. In addition, the NSQIP disseminates, through its annual reports distributed to participating hospitals, the good practices thought to underlie the greater than expected outcomes of some of its facilities. Since the NSQIP began, the 30-day postoperative mortality after major surgery in the VA has decreased by 27 percent and the 30-day morbidity by 45 percent (Khuri et al., 2002). However, the extent to which mammography facilities can adopt strategies used in surgical studies is unclear because the disciplines are so vastly different.

Part of the feasibility and success of the NSQIP is due to its centralized authority and advanced medical informatics infrastructure, which enabled it to develop national averages and risk adjustment models, and set up a model system for the comparative assessment of quality surgical care among its hospitals (Khuri et al., 2002). But a pilot study that used the same methods followed by NSQIP to provide quality improvement to the surgeries performed by three academic hospitals found it to be a feasible and valid system that is applicable to non-VA medical centers (Waynant et al., 1999; Khuri et al., 2002).

Although the above examples demonstrate the feasibility of collecting and analyzing large amounts of medical data from several disparate areas of the country, and maintaining patient, provider, and facility confidentiality while collecting and using electronic records, the advent of electronic records that enable the sharing of large datasets has been accompanied by increasing concern about protections for confidential medical information (Carney et al., 2000). Inappropriate access of such information could enable it to be exploited for marketing or insurance purposes, or could damage the reputation of patients, providers, or facilities and lead to malpractice lawsuits and loss of income (Carney et al., 2000). Thus, protecting audit data from discoverability is important to ensure accurate reporting and widespread participation in a self-assessment program designed to improve quality.

To maintain confidentiality of the medical information it uses, the NSQIP relies on a federal statute governing veterans’ benefits (Title 38) that contains provisions specifying the protection of confidential medical information used in quality management activities of the VA (Veterans Health Administration, 2004). There are a number of other federal and state statutes designed to protect the confidentiality of medical information used in quality assessments. The scope of protection offered by state statutes relevant to quality assessments varies widely and depends on how the information is handled and by whom. Furthermore, these statutes cannot be relied on to protect confidentiality when

Page 58 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

data are collected from more than one state and transferred across state lines (Carney et al., 2000).

The best protection of research data is offered by Federal Certificates of Confidentiality, which are granted to federally funded research projects or institutions, or for research of a sensitive nature, such as research on sexuality or the use of recreational drugs. The privacy protection afforded by these certificates extends not just to patients, but also to health care providers (Carney et al., 2000). The recently adopted federal Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule offers extensive safeguards to ensure the confidentiality of medical information. But it exempts from its stringent requirements the disclosure of protected health information that is used for “conducting quality assurance and quality improvement, including outcomes evaluation and development of clinical guidelines,” provided that the work is mainly intended to improve the operations of a specific organization rather than for research (FDA, 2004; Gunn et al., 2004). Thus, such an undertaking for quality improvement in mammography should be feasible.

Regardless of the statutory requirements already governing patient confidentiality, a national database for collecting and analyzing data to improve mammogram interpretation must use a number of safeguards to ensure confidentiality of the medical information it collects. These safeguards include preventing inappropriate access to electronic data with passwords, firewalls, and data encrypting techniques, as well as paper shredding and other proper disposal of sensitive printed information that is no longer needed (Carney et al., 2000).

Facility-Based Challenges

Several facility characteristics have an important impact on statistical measures of performance. Although there are accepted ranges for many mammography performance measures, the lack of a current system to control for variation of facility characteristics greatly limits the utility of performance measures and confounds facilities’ ability to compare their performance to accepted benchmarks. The following is a summary of several relevant facility characteristics that affect performance measures.

Practice Type and Setting

Based on the ACR’s 2003 Survey of Practices, 18 percent of radiologists in academic practice reported interpreting mammograms, but 73 percent of radiologists in multispecialty or private radiology facilities interpret mammograms. Furthermore, 55 percent of radiologists who reported that their practice setting was a hospital interpreted mammograms as compared with 68 percent of radiologists who reported working only in a non-hospital setting.

Hospital-based and medical center programs typically have affiliated departments of surgery, pathology, and oncology within the institution that can perform all necessary diagnostic testing and treatment. Patients who receive mammograms at facilities that are affiliated with a hospital are often diagnosed and treated within the same institution. As a result, efforts to track positive mammograms may require less time and be more complete in hospital-based mammography facilities than in those that are not hospital based. Hospital-based and medical center practices may use computerized radiology information

Page 59 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

systems, simplifying access to data for audit analyses. Hospital-based mammography and/or academic practices are more likely to interpret mammograms for patients who are at greater risk for developing breast cancer than other practices in the community.

Mode of Interpretation

Another important factor that can influence the performance characteristics of mammography is a facility’s mode of film interpretation. There are two commonly used interpretation modes for screening mammography, online (interpretation while the woman is still at the facility) and batch (screening mammograms are grouped by facility personnel for later interpretation by an interpreting physician during a focused, uninterrupted period of time).

Batch mode interpretation of screening mammograms allows interpreting physicians to focus their attention and allows interpretation of screening films in a quieter environment with fewer distractions and less ambient light, factors that are important for the conspicuity of subtle lesions. Batch interpretation is also a more cost-efficient method of practice for radiology facilities (Feig et al., 2000). Online reading offers women same-day resolution of their mammogram because additional diagnostic imaging or tissue sampling can be performed immediately after the initial interpretation, but it is more costly for mammography facilities and also involves inefficient use of interpreting physician time (Raza et al., 2001). Online interpretation also disrupts the workflow of interpreting physicians and can even result in misinterpretations (Raza et al., 2001). The current levels of reimbursement for mammography and perceptions of a shortage of radiologists who interpret mammograms likely account for the more widespread use of batch than online interpretation of screening mammograms (Raza et al., 2001). In 2002, 84 percent of community-based mammography facilities were batch interpreting screening mammograms (Hendrick et al., 2005). Those facilities that choose to perform screening mammography in an online mode often base their decision on the potential marketing advantage over other competitors by providing same-day results to patients.

However, for patients to receive results on the same day may be an unrealistic expectation for screening mammography, especially given that there is wide acceptance by patients for a wait of a week or more for laboratory test results, including Pap cytology. Hulka et al. (1997) have shown that women may be more accepting of delayed interpretations if the mammograms yield higher sensitivity, and Raza et al. (2001) have shown that although two-thirds of women would prefer online interpretation accompanied by a 30- to 60-minute wait for results, only about 10 percent would be willing to pay for the extra cost of the service.

A facility’s mode of interpretation can affect its performance as measured by clinical outcomes, thereby limiting interfacility comparison of measures such as specificity (Raza et al., 2001). Interpreting physicians are more apt to request additional diagnostic imaging when a patient is waiting in a nearby exam room than if she needs to be recalled at a later date (Raza et al., 2001; Ghate et al., 2005). Thus, same-day interpretations may perhaps result in higher recall rates (additional diagnostic breast imaging) and potentially lower specificity than interpretations performed in batch mode. Note that if a screening mammogram is interpreted as abnormal and additional diagnostic imaging evaluation is performed on the same day, it can be converted to a diagnostic mammogram using a billing code modifier. However, for auditing purposes, such a case

Page 60 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

should be considered to be an abnormal screening examination (BI-RADS 0), and the subsequent diagnostic examination should be audited separately.

Mobile Mammography

Mobile mammography is thought to address many of the barriers that prevent women, especially underserved women, from obtaining regular exams (Sickles et al., 1986). However, the overhead costs are significant (Wolk, 1992).

Mobile mammography programs use one of two types of vehicles, a recreational vehicle/coach that has been customized to fit a mammography unit, processor, and exam rooms (mobile van), or a standard passenger van that has been customized to allow for the transportation of a mammography unit to various indoor locations (portable unit). The most recent national survey of mobile mammography facilities in 1995 found that 37 percent operated portable units and 63 percent operated mobile vans (Brown and Fintor, 1995).

Most mobile mammography programs process all of the day’s films as a batch at the end of the day at a fixed location (De Bruhl et al., 1996). This saves money by eliminating the need to purchase a separate, dedicated processor for the mobile unit and allows for nearly twice as many women to be screened because the radiologic technologist does not need to process each set of films between patients (Sickles et al., 1986). Batch processing allows for a more controlled environment for processing, but causes inconvenience for women whose films need to be retaken. Because the films are not processed until the van has returned to its sponsoring facility, women whose films display movement, inadequate compression, or other problems must be recalled to the facility to have images retaken (Monsees and Destouet, 1992). It can be especially challenging to obtain retakes and additional diagnostic workup for women who have geographical or cultural barriers that prevented them from coming to the facility for screening in the first place (Pisano et al., 1995).

Images that would otherwise result in a retake at a fixed-site screening facility are sometimes deemed abnormal in a mobile setting in order to increase patient adherence with a recommendation for additional imaging, resulting in a greater percentage of abnormal mammograms that turn out to be false positive (Sickles, 1995b). Film quality problems can, therefore, manifest themselves as either high retake rates or high recall rates.

Current Radiology Information Systems Used by Mammography Facilities

Following implementation of MQSA regulations, several computerized mammography management systems were developed to assist mammography facilities in collecting, organizing, and linking mammography information to assist with reporting requirements. These include but are not limited to:

Insight Radiology Information System
PenRad
Amber Diagnostics Radiology Management Systems
iCAD Radiology Information Management
OmniCare Mammography Management Systems

Page 61 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

CPU Medical Management Systems
VitalWorks Radiology Information Systems
IDX-Rad
MRS

Examples of the services these systems provide include: patient registration and scheduling; dictation and transcription; multisite support, data access, management reporting (e.g., physician and patient letters), and mammography tracking; image routing and archive management; and staff competency tracking. The mammography tracking functions often collect and summarize data elements for the required audit. Though no studies have examined how these systems calculate performance data, anecdotal reports have indicated that significant variations and errors in calculation methods do occur⁴^,⁵ and limit the ability to pool data generated from these systems using the data tables they generate. This could be corrected if market forces drove these businesses to reprogram their databases to collect the same data using the same definitions and calculate biopsy yield and other accuracy indices using exactly the same methods. If this occurred, it would be possible for such computerized facilities to pool data centrally. However, some mammography facilities still do not have computerized information systems, making it very difficult to collect, tabulate, and update audit data.

LIMITATIONS OF CURRENT MQSA AUDIT REQUIREMENTS

MQSA regulations require that facilities establish and maintain a mammography medical outcomes audit program (21 C.F.R. § 900.12(f)). The results of the audit are not collected by FDA. To meet the regulatory requirements for the medical outcomes audit program, facilities need only demonstrate during their annual inspection that:

All positive mammograms (BI-RADS 4 or 5 assessments) are entered into the system.
Biopsy results are present (or the facility attempts to get them).
There is a designated reviewing interpreting physician.
An analysis is done annually, done separately for each individual, and done for the facility as a whole.

When issuing the 1997 MQSA final regulations, FDA noted that the medical audit process was in its infancy and stated that “in the absence of any consensus standards for either mammography outcomes or data collection methods, FDA has chosen to defer proposing these parameters and methods until more research has been completed and clear guidelines can be formulated for mammography centers.” FDA further noted that “results and outcomes of the [Breast Cancer Surveillance] Consortium will help establish performance standards for mammography and FDA will evaluate these for appropriateness for future standards under MQSA” (FDA, 1997).

⁴	Personal communication, E.Sickles, M.D., Chief of Breast Imaging, University of California, San Francisco, School of Medicine, October 15, 2004.
⁵	Personal communication, B.Yankaskas, Ph.D., Department of Radiology, University of North Carolina, Chapel Hill, October 15, 2004.

Page 62 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Unfortunately, although the expressed purpose of the medical outcomes audit program in MQSA Final Regulations was “to ensure the reliability, clarity, and accuracy of the interpretation of mammograms” (21 C.F.R. § 900.12 (f)), in its current form it does little more than burden facilities with processes that generate data that are inadequate for ensuring or improving interpretive performance.

First, facilities are allowed to develop systems for tracking positives that work best for them. Tracking systems can be maintained on computer or in paper form. MQSA requires that facilities attempt to collect cancer versus no-cancer outcomes for women referred for consideration of biopsy (Category 4 or 5), but no actual measurement of performance is required by FDA. Furthermore, although facilities may also choose to track women for whom additional imaging is recommended (defined in BI-RADS as Category 0; Incomplete—needs additional imaging), few track this latter assessment category, even though most screened women who are referred for biopsy start out with a recommendation for additional imaging. As many as 1 to 8 percent of mammography assessments are Category 0 (Poplack et al, 2000; Taplin et al, 2002; Geller et al, 2002; Colorado Mammography Project, 2003; Kerlikowske et al., 2003). While encouraged to have ongoing tracking systems in place, facilities are not required to follow up on positives more than once per year. FDA inspectors check to see that reasonable efforts have been made to obtain the pathology results of positive mammograms, but most facilities perform passive surveillance by contacting referral facilities episodically.

Second, no specific statistics are required to be calculated and reviewed as part of the annual medical outcomes audit. In practice, most facilities calculate only one type of PPV, and the method of calculation of this measure varies widely. Facilities are not required to stratify their analyses by screening and diagnostic mammograms and there is variation among facilities with respect to which mammograms are included in calculations. In addition, analyses must be facility specific, even though combining data for individual interpreting physicians who interpret for different facilities is more useful for assessing interpretive skill and less burdensome. Most important, the regulations require only that biopsy outcomes be reviewed. Although interpreting physicians and the facility are to be informed of the results of the audit, there is no further requirement for use of this data, such as for skills improvement. Facilities are not required to explore reasons for variability among interpreting physicians or undertake efforts to improve the quality of their performance. As a result, only facilities that are self-motivated to improve the quality of their mammograms undertake this feedback component, which appears to be the most important element for quality improvement.

Third, while a review of interval cancers has been useful for providing feedback to radiologists in other health care systems such as the United Kingdom (Perry, 2003) and British Columbia (Warren-Burhenne, 2003), MQSA-required review of false-negative cases of breast cancer that become known to a facility has little impact on performance. Facilities are not required to conduct active surveillance for false negatives and the fragmented nature of health care in the United States increases the likelihood that future mammograms and breast health care will occur in facilities other than the one that provided the initial mammogram. The potential medicolegal implications of finding cancers that were potentially misinterpreted serve as a strong deterrent to active efforts to identify false negatives for the purposes of quality assurance. Also, there are few centralized surveillance systems that link mammograms with subsequent diagnoses, thus allowing for

Page 63 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

more complete case ascertainment of false negatives; existing cancer registries are not a practical resource for this purpose.

STRATEGIES TO IMPROVE MEDICAL AUDIT OF MAMMOGRAPHY

The Committee concludes that current medical audit requirements under MQSA are inadequate for measuring or improving the quality of image interpretation. Thus, the Committee recommends that the basic MQSA-required medical audit of mammography interpretation be enhanced and standardized to require the calculation, for internal use within a given mammography facility, of the following core measures: PPV₂, cancer detection rate per 1,000 women, and abnormal interpretation rate (women whose mammogram interpretation led to additional imaging or biopsy). These three measures should be stratified by screening and diagnostic mammography. Readers working at multiple facilities should be able to combine their data from each facility to calculate one set of measures for their overall performance. In addition, the group of women with positive mammograms that facilities are required to track should include not only those with BI-RADS 4 or 5 assessments, but also all screened women for whom additional imaging is recommended (defined in BI-RADS as Category 0—needs additional imaging) until a final assessment is rendered, so that if biopsy is recommended (i.e., final assessment of BI-RADS 4 or 5), the appropriate examinations will be included in the calculations of PPV₂ and cancer detection rate. Implementing these enhanced and standardized audit procedures will provide facilities with consistent and meaningful measures of performance and will thus make it more feasible for audit interpreting physicians to develop performance improvement plans as needed. Facilities should receive additional reimbursement for undertaking these new audit activities (see section titled The Need for a Supportive Environment to Promote Quality Improvement).

To encourage physicians and facilities to achieve an even higher level of performance, the Committee also recommends a voluntary advanced-level audit that would involve obtaining breast pathology reports for tumor size, grade, and lymph node status, and collecting data on patient characteristics, in addition to all tracking, measurements, and assessments in the basic required audit. In order to achieve substantive improvements in the quality of mammography interpretations, broader changes are needed that will help facilities to conduct more meaningful analyses and to use the results to improve performance and quality assurance. However, to create a system that could accurately perform an advanced medical audit with feedback is well beyond the staffing and expertise of mammography facilities. Thus, an inherent component of this advanced audit program is the creation of a centralized data and statistical coordinating center, where standardized pooled data are electronically compiled, analyzed, and reported back to participating facilities to provide the type of meaningful feedback that is given in organized screening programs in some other countries.

Statistical coordinating center staff should be qualified to collect and maintain data from disparate sources, standardize data collection procedures, conduct accurate advanced-level audits, provide feedback, and help develop, implement, and evaluate self-improvement plans for interpreting physicians or facilities that do not achieve performance benchmarks. The statistical coordinating center should also test different methods of delivery of audit results and other uses of “feedback” to improve interpretive

Page 64 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

performance. The center could increase the impact of the basic required audits as well, by developing benchmarks that facilities and individual interpreting physicians could use to compare their performance to national performance benchmarks. In addition, the center could also undertake studies of randomly selected facilities using required basic audit procedures to ascertain the impact of these new procedures on interpretive quality. Public release of aggregate summary data and benchmarks will benefit everyone who participates in mammography.

The Breast Cancer Surveillance Consortium has already developed effective procedures and guidelines for mammography data collection and has demonstrated the feasibility of such an undertaking. Given the established expertise of the BCSC, it is a model for this new endeavor. The Agency for Healthcare Research and Quality (AHRQ) is also a possible choice for this role because it routinely collects and analyzes data for quality assessment and improvement purposes. Furthermore, it is able to afford such data the necessary protection from public disclosure that is essential to facilitate collection of accurate data.

The Committee believes FDA should not collect these data. Rather, FDA’s onsite inspectors should simply verify that the data have been collected and reviewed by interpreting physicians in each mammography facility. Current regulations require that each facility designate at least one “audit interpreting physician” to analyze and review the medical outcomes audit data, report the results to other interpreting physicians, and document the nature of any follow-up actions. No change in this procedure is warranted. It is impractical to subject these data to independent verification, and regulatory oversight is unnecessary for a voluntary program. There is potential for conflict if a regulatory body also provides analysis and feedback for quality improvement.

Incentives to participate in this voluntary audit should be incorporated into the program, such as paying for quality performance, as described in the section titled The Need for a Supportive Environment to Promote Quality Improvement.

BREAST IMAGING CENTERS OF EXCELLENCE

As noted previously, centralized breast cancer screening programs with extensive interpretive quality assurance activities currently operate in several other countries, and some organized screening programs have been established in the United States as well. Although evidence is lacking to assess the impact of individual elements of these programs, when implemented together, they appear to be effective for improving interpretive performance and quality assurance. While adapting health care quality assurance practices of countries with national health care systems to the diverse and fragmented delivery of health care in the United States may not be fully feasible, the challenge is not insurmountable as it has occurred within some integrated health plans, and through the NCI Breast Cancer Surveillance System. There is an urgent need to further test the concept by establishing demonstration and evaluation programs to designate and monitor the performance of Breast Imaging Centers of Excellence. These specialized Centers of Excellence could encourage a higher level of integration, performance, and quality assurance in breast cancer detection and diagnosis.

The Breast Imaging Centers of Excellence should have and test the following attributes:

Page 65 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

High-volume mammogram interpretation
Double reading
Proficiency of interpretation demonstrated by comprehensive medical audit as defined in the BI-RADS atlas, which exceeds the voluntary advanced audit
Systematic feedback and an infrastructure linking mammography performance to patient outcomes
Patient reminder systems

In addition, pilot projects should be established within selected centers to further develop and evaluate interpretive skills assessment exams such as MISA. Centers of Excellence should also incorporate expertise in currently accepted nonmammographic imaging modalities for breast cancer diagnosis, such as ultrasound and magnetic resonance imaging, with accreditation where applicable. Centers would thus have the expertise to develop and host on site training programs in breast imaging—for mammography as well as other imaging modalities. Programs could be tailored for initial training and CME, as well as personalized training for interpreting physicians whose medical audit results indicate that they need to improve their skills. Interpretive skills assessment exams could be administered before and after training.

By providing multidisciplinary training and work environments for diagnosing women with breast cancer, Centers could increase job satisfaction and retention of practitioners and increase the productivity and quality of all members of the breast care team; high-quality facilities will attract high-quality personnel at all professional levels.

Breast imaging centers could also potentially improve access to mammography in low-volume areas by offering centralized interpretation through either soft-copy telemammography or by receiving films shipped from remote and/or mobile facilities. In their capacity to serve as regional readers of mammograms, Centers of Excellence should improve the ability of currently underserved populations and communities to access mammography services.

While specialized breast imaging centers that obtain the designation of Breast Imaging Center of Excellence would be expected to gain patients and referrals by reputation, additional incentives should be offered to encourage the delivery of demonstrably high-quality care. Such incentives should include higher reimbursement rates for breast imaging procedures, and eligibility to test a no-fault medical liability insurance program, discussed in Chapter 5. In the absence of such incentives, existing breast imaging centers are likely to view the extra cost and effort required for designation as a Center of Excellence as an unnecessary burden, thereby limiting the number of facilities that participate.

Integrating Breast Imaging Centers of Excellence into Interdisciplinary Breast Care

Ideally, Breast Imaging Centers of Excellence will be linked with facilities that offer comprehensive and multidisciplinary treatment and support for breast cancer (Box 2-2). The best among such facilities feature interdisciplinary care based on ongoing communication and collaboration among the multiple disciplines involved in diagnosing and treating cancer (Rabinowitz, 2004). This approach to disease management is intended to optimize the broad range of diagnostic techniques and therapies now available to ad-

Page 66 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

BOX 2–2
Models of Integrated Breast Care

During three decades, a growing recognition of the complexity of treating breast disease and the need for coordination among the many contributors to that process have led to the development of breast cancer clinics, mammography centers, and comprehensive breast cancer centers. Although diverse in scope and setting, such “clinic-based” organizations have sought several common goals: to decrease morbidity, mortality, and anxiety associated with breast cancer detection and treatment; to increase coordination and communication among patients, multiple professionals, departments, and health care facilities; to foster participation in behavioral and translational research; and to define, measure, and monitor quality determinants of clinical, operations, and financial success for the sponsoring organization.

An interdisciplinary approach to breast cancer detection and treatment may be especially important in the United States, with its fragmented and specialized delivery of medical care. Initially, this led to the development of freestanding, comprehensive breast care programs such as the Van Nuys (California) Breast Center, founded in 1979. Advances in information technology have since permitted the development of “breast centers without walls” that allow clinicians involved in separate practices and locations to collaborate effectively on the care of individual patients.

In 2000, the European Society of Mastology (EUSOMA) published a paper defining the requirements of a “specialist breast unit” and establishing critical standards for such units, including the size of the patient population, the type and qualifications of its personnel, and a scope of care encompassing all stages of breast disease and evaluated through a common quality assurance system and database. According to these guidelines, which were intended to unify and standardize heterogeneous European breast care programs (Mansel, 2000), each member of a breast unit’s core team (consisting of a clinical director, surgeons, radiologists, a pathologist, nurses, an oncologist, diagnostic radiographers, and a data manager) is required to have advanced skills, obtained by spending a year in a specialized training unit. While the EUSOMA model continues to be discussed and refined, it remains to be implemented across Europe, and large variations in breast cancer service delivery persist among European countries.

SOURCES: Rabinowitz (2000, 2004); Kolb (2000); Coleman (2005); Silverstein (1973, 2000); Coleman and Lebovic (1996); EUSOMA (2000); de Wolf (2003); Multidisciplinary coordination (2003).

dress breast cancer, as well as other diseases (August et al., 1993: Kolb, 2000: Rabinowitz, 2004).

A conceptual framework for improving health outcomes for cancer patients. Quality in the Continuum of Cancer Care (QCCC), recognizes the critical importance of the steps in the process of cancer care from prevention, screening, diagnosis, and treatment to end-of-life care. The implication of this conceptualization is that the transitions between being at risk in the population and coming in for screening, or having an abnormal test and getting treated, are as important as each of the steps (Zapka et al., 2003). Failures in the transitions are associated with later stage cancer occurrence (Sung et al., 2000; Yankaskas et al., 2005). By focusing on the steps and transitions in care where failures can occur, the QCCC framework aims to facilitate more organized systems of interdisci-

Page 67 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

plinary medical practice that improve care, and establish meaningful measures of quality that promote improved outcomes.

Some reports suggest that interdisciplinary breast care may facilitate timelier treatment as well as less invasive surgery and better patient satisfaction (Rabinowitz, 2004). Higher rates of breast-conserving surgery and lower rates of false-negative breast biopsies have been observed in high-volume, specialized settings (Chang et al., 2001; Smith-Bindman et al., 2003; Tripathy, 2004); however, it remains to be determined whether care in such facilities is associated with improved rates of recurrence or survival (Tripathy, 2004). More specifically, advocates of interdisciplinary breast cancer care stress its advantages in promoting communication between radiologists and pathologists, particularly regarding prospective treatment planning (Rabinowitz, 2004).

The future expansion of interdisciplinary breast cancer care programs is expected to emphasize participation in clinical trials, research and research training, and the use of emerging technologies to promote information sharing and to facilitate the transition between care in urban-based cancer centers and physicians serving medically underserved populations (Garcia, 2004). Such improvements could equally enhance Breast Imaging Centers of Excellence.

THE NEED FOR A SUPPORTIVE ENVIRONMENT TO PROMOTE QUALITY IMPROVEMENT

Increased regulations alone may cause facilities to focus on meeting mandatory minimum requirements rather than motivating them to strive for maximal quality assurance and improvement that could additionally benefit the public’s health. Supportive elements to help improve interpretive performance may not be as easily implemented as regulations, but in their absence, new requirements may not manifest meaningful improvements and would be viewed primarily as an added burden by mammography facilities and personnel. Mammography is already one of the most highly regulated medical procedures in the country, so if radiologists believe that additional regulations and oversight with respect to interpretation are merely punitive and burdensome, they may find other areas of practice more appealing.

Many countries with organized breast cancer screening programs have implemented extensive quality assurance procedures that require additional resources for that purpose (Klabunde et al., 2001). For example, most have designated staff with special training for data quality assurance. Furthermore, improvements in interpretive quality within programs such as the UK National Screening Programme have been highly dependent on a supportive environment and on funds specifically designated for quality assurance (Perry, 2004b). The workload and costs associated with meeting MQSA requirements are significant, and the new audit procedures proposed here will add to the workload and expense of adhering to MQSA requirements. However, historically these costs have not been factored into reimbursement, placing a considerable financial burden on facilities. Hence, the Centers for Medicare and Medicaid Services (CMS) and other health care payers must account for the cost of adhering to federally mandated audit procedures when setting reimbursement rates for mammography. Adequate reimbursement for MQSA compliance will be essential to maintain women’s access to mammography services.

Page 68 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

In the United States, some disincentives to practice mammography already exist (see Chapter 4 for more detail). Thus, the Committee believes that it is essential to provide positive incentives for facilities and individual interpreting physicians to aim for the highest level of quality assurance. Participation in the voluntary advanced audit process will likely lead to a higher quality of performance and care, but it will also entail a considerable increase in workload and paperwork, so a supportive approach is critical. Past experience with voluntary mammography accreditation suggests that participation in new voluntary quality assurance activities will be limited in the absence of incentives.

One strong incentive would be paying for quality. A number of large health insurers have recently initiated “pay for quality” (PFQ) programs (see Box 2-3). CMS is also developing a PFQ policy, and private health insurers often follow the lead of CMS. The extensive quality assurance procedures proposed for the voluntary advanced audit justify the use of such an approach for mammography. Eligibility to participate in such a program should depend on documentation that the facility meets specific performance criteria. Although general guidelines for performance have been put forth previously (Bassett et al., 1994), the performance criteria for this specific program are different, and they should be determined and periodically updated by an informed group of experts and patient advocates.

BOX 2–3
Paying for Quality

The National Committee for Quality Assurance (NCQA) report The State of Health Care Quality discusses at length the quality gaps present in our existing health care system. The quality gap between the top 10 percent of health care plans and the national average was estimated at 20 percent across the medical field, far higher than most other industries. For example, in the airline industry, the quality gap in terms of safety is less than 1 percent. Providing incentives for physicians striving to improve quality is an option receiving significant attention, despite considerable obstacles to implementation. The most effective “pay for quality” (PFQ) programs in operation today encourage cooperation among health care plans, providers, consumers, and patients, potentially advancing our health care system on both the whole-system and individual levels.

Two core principles guide PFQ programs: a common set of metrics used to assess performance, and funding to support performance improvement. Ensuring the success of PFQ programs requires that funding is aligned with program goals. Payment should reward effective care and encourage the development of effective care delivery systems. PFQ programs can, in theory, have far-reaching benefits. Participating health plans could profit from increased provider quality and consumer satisfaction. Improvements in medical care should result in a healthier general workforce, benefiting employers. Individual physicians receive direct feedback on performance, while physician groups actively improving quality receive direct monetary rewards. Moreover, care centers that achieve a higher level of quality assurance will likely encounter indirect rewards through increased patient enrollment.

Page 69 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

A crucial step in improving quality care is utilization of information technology. Electronic patient records enhance access to patient information and allow the use of reminder systems. Although implementing electronic record systems can prove costly, with some estimates of $10,000 to $20,000 per physician per year, a 5-year cost-benefit analysis found a net gain of $86,400 per provider. Despite this potential profit, only 7 percent of U.S. physicians use electronic records. PFQ programs can provide the incentive for physician groups to take on this burdensome, yet valuable task.

Currently, physicians are reimbursed for basic care through fee-for-service programs, capitation, or salary rewards, yet each is inadequate for a true PFQ program. Fee-for-service does not provide incentives to improve the quality of care delivery, while salaries hinder innovation and may reduce productivity. Capitation does not provide reimbursement for additional expenses accrued to improve quality. Better methods of reimbursement are necessary to properly reward quality and promote improvement.

Advocates believe that orienting treatment groups around a common clinical purpose could facilitate quality improvement. Overall, the prevalence of PFQ programs has increased, and they vary in method and scope. One example is the “Bridges to Excellence” program, funded by General Electric, which established a partnership that includes Partners HealthCare, Tufts Health Plan, and several other Massachusetts health groups. Physician groups pursuing improvement systems and meeting specific standards of care, including electronic records and disease registries, receive up to $55 per patient per year. Anthem Blue Cross and Blue Shield is also offering programs in several locations, including New Hampshire and Michigan. This program uses the Health Plan Employer Data and Information Set (HEDIS) quality measures published by NCQA to assess and reward physicians providing excellent care; practices received up to $12,062 in 2002. The most advanced PFQ program to date, California’s Integrated Healthcare Association (IHA), involves six major health plans. Assessment includes clinical measures, patient satisfaction, and information technology/infrastructure investment.

The Centers for Medicare and Medicaid Services (CMS) is currently in the design and development phase of the Healthcare Quality Demonstration, established by the Medicare Modernization Act of 2003 and set to begin January 1, 2006. The program will investigate shared decision making for patient-centered care, redesign of care networks to focus on outcome improvement, methods to reduce practice variation, and quality incentives. The Physician Group Practice Demonstration combines fee-for-service payments with incentive programs to reward group practices for financial performance and quality improvement. Additionally, CMS is funding a study at the Institute of Medicine, titled “Redesigning Health Insurance Benefits, Payment, and Performance Improvement Programs,” which focuses, in part, on how improvements in quality and care delivery can be rewarded.

SOURCES: CMS (2004); Epstein et al. (2004); Integrated Healthcare Association (2004); National Committee for Quality Assurance (2004); Casalino et al. (2003); Wang et al. (2003); DiSalvo et al. (2001); Deloitte & Touche (2000); IOM (2001 a); Personal correspondence, J.Pila, Centers for Medicare and Medicaid Services (September 29, 2004).

Page 70 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

SUMMARY AND CONCLUSIONS

The effectiveness of mammography greatly depends on the quality of image interpretation, but reading mammograms and assessing interpretive performance are both quite challenging. The Committee concludes that current medical audit requirements under MQSA fail to fulfill a meaningful function for measuring or improving the quality of image interpretation. Medical audits should be designed to link practice patterns to patient outcomes in a way that can influence provider performance. Interpreting physicians need to understand their current level of performance before they can take action to improve their interpretive accuracy.

Thus, the medical audit required under MQSA should be enhanced and standardized to include the calculation of three core measures, calculated separately for screening and diagnostic mammograms. In addition, the group of women that facilities are required to track should be expanded to include not only women with BI-RADS 4 and 5 assessments, but also all women for whom additional imaging is recommended, until a final assessment is rendered.

Facilities should also be encouraged to strive for a higher level of quality performance through participation in two voluntary programs. These programs should be given a high priority so that more specific recommendations about monitoring and feedback requirements can be established in the future. Mandating these approaches for all mammography facilities in the United States is not feasible at present because of the fragmented nature of health care delivery, but experience with them on a voluntary basis eventually could lead to higher standards for all facilities. First, a voluntary advanced audit should include the collection of tumor staging information from pathology reports and collection of patient characteristics in addition to all tracking, measurements, and assessments in the basic required audit. This should be facilitated by the formation of a data and statistical coordinating center that would collect the data and conduct accurate and standardized advanced-level audits, provide feedback, and help develop, implement, and evaluate improvement plans. The BCSC and AHRQ both have characteristics that make them viable options for undertaking the endeavor.

Second, the establishment of Breast Imaging Centers of Excellence could encourage U.S. facilities in diverse settings to adopt—and adapt—features of successful foreign programs. Evidence is insufficient to assess the impact of individual components of these programs on performance, but when implemented together, with systematic feedback, they appear to be effective for improving quality. This undertaking will likely require cooperative efforts of several organizations, perhaps to include NCI, CMS, and AHRQ.

A supportive environment is also essential for improving interpretive performance, as demonstrated by high-quality programs in other countries. The expanded audit procedures proposed here will increase the cost of compliance. Thus, reimbursement rates for mammography should be reformulated to account for the costs of complying with federally mandated regulations. In addition, developing a “pay for quality” program to reward high levels of performance and quality assurance would be a strong incentive to participate in the advanced-level audit and to seek designation as a Breast Imaging Center of Excellence. Testing an alternative “no fault” approach to medical liability insurance that is linked to high performance standards (described in detail in Chapter 5) could provide an additional incentive to seek designation as a Center of Excellence.

Page 71 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

The Committee also considered a number of other approaches that could potentially improve interpretive performance, such as double reading, use of CAD, increased continuing experience (interpretive volume) requirements, and CME programs that focus on interpretation and self-assessment. While there is some evidence to suggest that these approaches could also improve the quality of mammography interpretation, the data available to date are insufficient to justify changes to MQSA legislation or regulations. However, the Committee recommends that additional studies be rapidly undertaken to develop a stronger evidence base for the effects of CME, reader volume, double reading, and CAD on interpretive performance.

REFERENCES

Adcock KA. 2004. Initiative to improve mammogram interpretation. The Permanente Journal 8(2):12–18.

American College of Radiology. 2003. ACR BI-RADS®—Mammography. In: ACR Breast Imaging Reporting and Data System, Breast Imaging Atlas. 4th ed. Reston, VA: American College of Radiology.

Andersson I, Aspegren K, Janzon L, Landberg T, Lindholm K, Linell F, Ljungberg O, Ranstam J, Sigfusson B. 1988. Mammographic screening and mortality from breast cancer: The Malmo Mammographic Screening Trial. British Medical Journal 297(6654):943–948.

Anttinen I, Pamilo M, Soiva M, Roiha M. 1993. Double reading of mammography screening films—one radiologist or two? Clinical Radiology 48(6):414–421.

Applied Vision Research Institute. 2004. PERFORMS: SA2003 Report to the National Coordinating Committee for QA Radiologists. Derby, England: University of Derby.

August DA, Carpenter LC, Harness JK, Delosh T, Cody RL, Adler DD, Oberman H, Wilkins E, Schottenfeld D, McNeely SG. 1993. Benefits of a multidisciplinary approach to breast care. Journal of Surgical Oncology 53(3):161–167.

Baker JA, Rosen EL, Lo JY, Gimenez EI, Walsh R, Soo MS. 2003. Computer-aided detection (CAD) in screening mammography: Sensitivity of commercial CAD systems for detecting architectural distortion. American Journal of Roentgenology 181(4):1083–1088.

Ballard-Barbash R, Taplin SH, Yankaskas BC, Ernster VL, Rosenberg RD, Carney PA, Barlow WE, Geller BM, Kerlikowske K, Edwards BK, Lynch CF, Urban N, Chrvala CA, Key CR, Poplack SP, Worden JK, Kessler LG. 1997. Breast Cancer Surveillance Consortium: A national mammography screening and outcomes database. American Journal of Roentgenology 169(4):1001–1008.

Barlow WE, Chi C, Carney PA, Taplin SH, D’Orsi C, Cutter G, Hendrick RE, Elmore JG. 2004. Accuracy of screening mammography interpretation by characteristics of radiologists. Journal of the National Cancer Institute 96(24):1840–1850.

Barlow WE, Lehman CD, Zheng Y, Ballard-Barbash R, Yankaskas BC, Cutter GR, Carney PA, Geller BM, Rosenberg R, Kerlikowske K, Weaver DL, Taplin SH. 2002. Performance of diagnostic mammography for women with signs or symptoms of breast cancer. Journal of the National Cancer Institute 94(15):1151–1159.

Barton MB, Morley DS, Moore S, Allen JD, Kleinman KP, Emmons KM, Fletcher SW. 2004. Decreasing women’s anxieties after abnormal mammograms: A controlled trial. Journal of the National Cancer Institute 96(7):529–538.

Page 72 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Bassett LW, Hendrick R, Bassford T, Butler PF, Carter D, DeBor M, D’Orsi CJ, Garlinghouse CJ, Jones RF, Langer AS, Lichtenfeld JL, Osuch JR, Reynolds LN, deParedes ES, Williams RE. 1994. Quality determinants of mammography. Clinical Practice Guideline No. 13. AHCPR Publication No. 95–0632. Rockville, MD: Agency for Health Care Policy and Research.

Beam CA, Conant EF, Sickles EA. 2002. Factors affecting radiologist inconsistency in screening mammography. Academic Radiology 9(5):531–540.

Beam CA, Conant EF, Sickles EA. 2003. Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. Journal of the National Cancer Institute 95(4):282–290.

Beam CA, Layde PM, Sullivan DC. 1996. Variability in the interpretation of screening mammograms by U.S. radiologists: Findings from a national sample. Archives of Internal Medicine 156(2):209–213.

Bennett NL, Davis DA, Easterling WE, Friedmann P, Green JS, Koeppen BM, Mazmanian PE, Waxman HS. 2000. Continuing medical education: A new vision of the professional development of physicians. Academic Medicine 75(12):1167–1172.

Berg WA, D’Orsi CJ, Jackson VP, Bassett LW, Beam CA, Lewis RS, Crewson PE. 2002. Does training in the Breast Imaging Reporting and Data System (BI-RADS) improve biopsy recommendations or feature analysis agreement with experienced breast imagers at mammography? Radiology 224(3):871–880.

Berlin L. 2003. Breast cancer, mammography, and malpractice litigation: The controversies continue. American Journal of Roentgenology 180(5):1229–1237.

Brem RF, Schoonjans JM. 2001. Radiologist detection of microcalcifications with and without computer-aided detection: A comparative study. Clinical Radiology 56(2):150–154.

Brett J, Austoker J. 2001. Women who are recalled for further investigation for breast screening: Psychological consequences 3 years after recall and factors affecting re-attendance. Journal of Public Health Medicine 23(4):292–300.

Brown ML, Fintor L. 1995. U.S. screening mammography services with mobile units: Results from the National Survey of Mammography Facilities. Radiology 195(2):529–532.

Brown ML, Houn F, Sickles EA, Kessler LG. 1995. Screening mammography in community practice: Positive predictive value of abnormal findings and yield of follow-up diagnostic procedures. American Journal of Roentgenology 165(6):1373–1377.

Buist DS, Porter PL, Lehman C, Taplin SH, White E. 2004. Factors contributing to mammography failure in women aged 40–49 years. Journal of the National Cancer Institute 96(19):1432–1440.

Byrne C. 1997. Studying mammographic density: Implications for understanding breast cancer. Journal of the National Cancer Institute 89(8):531–533.

Carney PA, Geller BM, Moffett H, Ganger M, Sewell M, Barlow WE, Stalnaker N, Taplin SH, Sisk C, Ernster VL, Wilkie HA, Yankaskas B, Poplack SP, Urban N, West MM, Rosenberg RD, Michael S, Mercurio TD, Ballard-Barbash R. 2000. Current medicolegal and confidentiality issues in large, multicenter research programs. American Journal of Epidemiology 152(4):371–378.

Carney PA, Miglioretti DL, Yankaskas BC, Kerlikowske K, Rosenberg R, Rutter CM, Geller BM, Abraham LA, Taplin SH, Dignan M, Cutter G, Ballard-Barbash R. 2003. Individual and combined effects of age, breast density, and hormone replacement therapy use on the accuracy of screening mammography. Annals of Internal Medicine 138(3):168–175.

Page 73 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Casalino L, Gillies RR, Shortell SM, Schmittdiel JA, Bodenheimer T, Robinson JC, Rundall T, Oswald N, Schauffler H, Wang MC. 2003. External incentives, information technology, and organized processes to improve health care quality for patients with chronic diseases. JAMA 289(4):434–441.

Cave DG. 1995. Profiling physician practice patterns using diagnostic episode clusters. Medical Care 33(5):463–486.

Centers for Medicare and Medicaid Services. 2004. Physician Group Practice Demonstration. [Online]. Available: http://www.cms.hhs.gov/researchers/demos/pgpdemo.asp? [accessed September 29, 2004].

Chang JH, Vines E, Bertsch H, Fraker DL, Czerniecki BJ, Rosato EF, Lawton T, Conant EF, Orel SG, Schuchter L, Fox KR, Zieber N, Glick JH, Solin LJ. 2001. The impact of a multidisciplinary breast cancer center on recommendations for patient management: The University of Pennsylvania experience. Cancer 91(7):1231–1237.

Coleman C. 2005. The breast cancer clinic: Yesterday, today, and tomorrow. In: Buchsel PC, Yarbro CH, eds. Oncology Nursing in the Ambulatory Setting. Sudbury, MA: Jones and Bartlett Publishers. Pp. 231–245.

Coleman C, Lebovic GS. 1996. Organizing a comprehensive breast center. In: Harris JR, Lippman ME, Morrow M, eds. Diseases of the Breast. Philadelphia, PA: Lippincott-Raven Publishers. Pp. 963–970.

Colorado Mammography Project. 2003. Colorado Mammography Project: Data. [Online]. Available: http://cmap.cooperinstden.org/data.htm [accessed December 16, 2004].

Davis D, O’Brien MA, Freemantle N, Wolf F, Mazmanian P, Taylor-Vaisey A. 1999. Impact of formal continuing medical education: Do conferences, workshops, rounds, and other traditional continuing education activities change physician behavior or health care outcomes? JAMA 282(9):867–874.

Davis DA, Thomson MA, Oxman AD, Haynes RB. 1992. Evidence for the effectiveness of CME. A review of 50 randomized controlled trials. JAMA 268(9):1111–1117.

Davis DA, Thomson MA, Oxman AD, Haynes RB. 1995. Changing physician performance. A systematic review of the effect of continuing medical education strategies. JAMA 274(9):700–705.

De Bruhl ND, Bassett LW, Jessop NW, Mason AM. 1996. Mobile mammography: Results of a national survey. Radiology 201(2):433–437.

de Wolf C. 2003. The Need for EU Guidelines for Multidisciplinary Breast Care. Presentation at the meeting of the European Parliament, February 17, 2003, Brussels, Belgium. [Online]. Available: http://www.europarl.eu.int/workshop/breast_cancer/docs/de_wolf_en.pdf [accessed November 4, 2004].

Dee KE, Sickles EA. 2001. Medical audit of diagnostic mammography examinations: Comparison with screening outcomes obtained concurrently. American Journal of Roentgenology 176(3):729–733.

Deloitte & Touche. 2000. Taking the Pulse: Physicians and the Internet. New York: Deloitte & Touche.

Destounis SV, DiNitto P, Logan-Young W, Bonaccio E, Zuley ML, Willison KM. 2004. Can computer-aided detection with double reading of screening mammograms help decrease the false-negative rate? Initial experience. Radiology 232(2):578–584.

Dinnes J, Moss S, Melia J, Blanks R, Song F, Kleijnen J. 2001. Effectiveness and cost-effectiveness of double reading of mammograms in breast cancer screening: Findings of a systematic review. Breast 10(6):455–463.

Page 74 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

DiSalvo TG, Normand SL, Hauptman PJ, Guadagnoli E, Palmer RH, McNeil BJ. 2001. Pitfalls in assessing the quality of care for patients with cardiovascular disease. American Journal of Medicine 111(4):297–303.

Duijm LE, Groenewoud JH, Hendriks JH, de Koning HJ. 2004. Independent double reading of screening mammograms in the Netherlands: Effect of arbitration following reader disagreements. Radiology 231(2):564–570.

Egger JR, Cutter GR, Carney PA, Taplin SH, Barlow WE, Hendrick RE, D’Orsi CJ, Fosse JS, Abraham L, Elmore JG. In press. Mammographers’ perception of women’s breast cancer risk. Medical Decision Making.

Egglin TK, Feinstein AR. 1996. Context bias. A problem in diagnostic radiology. JAMA 276(21):1752–1755.

Elmore JG, Armstrong K, Lehman CD, Fletcher SW. 2005. Screening for breast cancer. JAMA 293(10):1245–1256.

Elmore JG, Feinstein AR. 1992. A bibliography of publications on observer variability (final installment). Journal of Clinical Epidemiology 45(6):567–580.

Elmore JG, Miglioretti DL, Reisch LM, Barton MB, Kreuter W, Christiansen CL, Fletcher SW. 2002. Screening mammograms by community radiologists: Variability in false-positive rates. Journal of the National Cancer Institute 94(18):1373–1380.

Elmore JG, Nakano CY, Koepsell TD, Desnick LM, D’Orsi CJ, Ransohoff DF. 2003. International variation in screening mammography interpretations in community-based programs. Journal of the National Cancer Institute 95(18):1384–1393.

Elmore JG, Taplin S, Barlow WE, Cutter G, D’Orsi C, Hendrick RE, Abraham L, Fosse J, Carney PA. In press. Community radiologists’ medical malpractice experience, concerns, and interpretive performance. Radiology.

Elmore JG, Wells CK, Howard DH, Feinstein AR. 1997. The impact of clinical history on mammographic interpretations. JAMA 277(1):49–52.

Elmore JG, Wells CK, Howard DH. 1998. Does diagnostic accuracy in mammography depend on radiologists’ experience? Journal of Women’s Health 7(4):443–449.

Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. 1994. Variability in radiologists’ interpretations of mammograms. New England Journal of Medicine 331(22):1493–1499.

Elwood JM, Cox B, Richardson AK. 1993. The effectiveness of breast cancer screening by mammography in younger women. Online Journal of Current Clinical Trials. Doc. No. 32.

Epstein AM, Lee TH, Hamel M. 2004. Paying physicians for high-quality care. New England Journal of Medicine 350(18):1910–1912.

Ernster VL, Ballard-Barbash R, Barlow WE, Zheng Y, Weaver DL, Cutter G, Yankaskas BC, Rosenberg R, Carney PA, Kerlikowske K, Taplin SH, Urban N, Geller BM. 2002. Detection of ductal carcinoma in situ in women undergoing screening mammography. Journal of the National Cancer Institute 94(20):1546–1554.

Esserman L, Cowley H, Eberle C, Kirkpatrick A, Chang S, Berbaum K, Gale A. 2002. Improving the accuracy of mammography: Volume and outcome relationships. Journal of the National Cancer Institute 94(5):369–375.

European Society of Mastology (EUSOMA). 2000. The requirements of a specialist breast unit. European Journal of Cancer 36(18):2288–2293.

FDA (U.S. Food and Drug Administration). 1997. Quality Mammography Standards; Final Rule (Preamble). 21 C.F.R. Parts 16 and 900.

FDA. 2004. HIPAA and Release of Information for MQSA Purposes. [Online]. Available: http://www.fda.gov/cdrh/mammography/mqsa-rev.html#HIPPA [accessed October 15, 2004].

Page 75 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Feig SA, Hall FM, Ikeda DM, Mendelson EB, Rubin EC, Segel MC, Watson AB, Eklund GW, Stelling CB, Jackson VP. 2000. Society of Breast Imaging residency and fellowship training curriculum. Radiologic Clinics of North America 38(4):xi, 915–920.

Feig SA, Sickles EA, Evans WP, Linver MN. 2004. Re: Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. Journal of the National Cancer Institute 96(16):1260–1261; author reply, 1261.

Feinstein AR. 1985. A bibliography of publications on observer variability. Journal of Chronic Diseases 38(8):619–632.

Fletcher SW, Black W, Harris R, Rimer BK, Shapiro S. 1993. Report of the International Workshop on Screening for Breast Cancer. Journal of the National Cancer Institute 85(20):1644–1656.

Fletcher SW, Elmore JG. 2003. Clinical practice. Mammographic screening for breast cancer. New England Journal of Medicine 348(17):1672–1680.

Frankel SD, Sickles EA, Curpen BN, Sollitto RA, Ominsky SH, Galvin HB. 1995. Initial versus subsequent screening mammography: Comparison of findings and their prognostic significance. American Journal of Roentgenology 164(5):1107–1109.

Freer TW, Ulissey MJ. 2001. Screening mammography with computer-aided detection: Prospective study of 12,860 patients in a community breast center. Radiology 220(3):781–786.

Frisell J, Eklund G, Hellstrom L, Lidbrink E, Rutqvist LE, Somell A. 1991. Randomized study of mammography screening—preliminary report on mortality in the Stockholm trial. Breast Cancer Research & Treatment 18(1):49–56.

Gale AG. 2003. PERFORMS: A self-assessment scheme for radiologists in breast screening. Seminars in Breast Disease 6(3):148–152.

Garcia R. 2004. Interdisciplinary breast cancer care: Declaring and improving the standard. Review. Oncology (Huntington) 18(10):1268–1270.

Geller BM, Barlow WE, Ballard-Barbash R, Ernster VL, Yankaskas BC, Sickles EA, Carney PA, Dignan MB, Rosenberg RD, Urban N, Zheng Y, Taplin SH. 2002. Use of the American College of Radiology BI-RADS to report on the mammographic evaluation of women with signs and symptoms of breast disease. Radiology 222(2):536–542.

Ghate SV, Soo MS, Baker JA, Walsh R, Gimenez EI, Rosen EL. 2005. Comparison of recall and cancer detection rates for immediate versus batch interpretation of screening mammograms. Radiology 235(1):31–35.

Gigerenzer G. 2002. Calculated Risks: How to Know When Numbers Deceive You. New York: Simon & Schuster.

Greenfield S, Kaplan SH, Kahn R, Ninomiya J, Griffith JL. 2002. Profiling care provided by different groups of physicians: Effects of patient case-mix (bias) and physician-level clustering on quality assessment results. Annals of Internal Medicine 136(2):111–121.

Gunn PP, Fremont AM, Bottrell M, Shugarman LR, Galegher J, Bikson T. 2004. The Health Insurance Portability and Accountability Act Privacy Rule: A practical guide for researchers. Medical Care 42(4):321–327.

Gur D, Sumkin JH, Rockette HE, Ganott M, Hakim C, Hardesty L, Poller WR, Shah R, Wallace L. 2004. Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. Journal of the National Cancer Institute 96(3):185–190.

Harvey SC, Geller B, Oppenheimer RG, Pinet M, Riddell L, Garra B. 2003. Increase in cancer detection and recall rates with independent double interpretation of screening mammography. American Journal of Roentgenology 180(5):1461–1467.

Page 76 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Hendee WR, Pattern JA, Simmons G. 1999. A hospital-employed physicist working in radiology should provide training to nonradiologists wishing to offer imaging services. Medical Physics 26(6):859–861.

Hendrick RE, Cutter GR, Berns EA, Nakano C, Egger J, Carney PA, Abraham L, Taplin SH, D’Orsi CJ, Barlow W, Elmore JG. 2005. Community-based mammography practice: Services, charges, and interpretation methods. American Journal of Roentgenology 184(2):433–438.

Henson RM, Wyatt SW, Lee NC. 1996. The National Breast and Cervical Cancer Early Detection Program: A comprehensive public health response to two major health issues for women. Journal of Public Health Management & Practice 2(2):36–47.

Hofvind S, Thresen S, Tretli S. 2004. The cumulative risk of a false-positive recall in the Norwegian Breast Cancer Screening Program. Cancer 101(7):1501–1507.

Hulka CA, Slanetz PJ, Halpern EF, Hall DA, McCarthy KA, Moore R, Boutin S, Kopans DB. 1997. Patients’ opinion of mammography screening services: Immediate results versus delayed results due to interpretation by two observers. American Journal of Roentgenology 168(4):1085–1089.

Hutton B, Bradt E, Chen J, Gobrecht P, O’Connell J, Pedulla A, Signorelli T, Bisner S, Hoffman D, Lawson H. 2004. Breast cancer: Screening data for assessing quality of services: New York, 2000–2003. Morbidity & Mortality Weekly Report 53(21):455–457.

Integrated Healthcare Association. 2004. IHA “Pay For Performance”. [Online]. Available: http://www.iha.org/Ihaproj.htm [accessed September 29, 2004].

IOM (Institute of Medicine). 2001a. Crossing the Quality Chasm: A New Health System for the 21st Century. Washington, DC: National Academy Press.

IOM. 2001b. Interpreting the Volume-Outcome Relationship in the Context of Cancer Care. Washington, DC: National Academy Press.

IOM. 2005. Saving Women’s Lives: Strategies for Improving Breast Cancer Detection and Diagnosis. Washington, DC: The National Academies Press.

Jamtvedt G, Young JM, Kristoffersen DT, Thomson O’Brien MA, Oxman AD. 2003. Audit and feedback: Effects on professional practice and health care outcomes [Update of Cochrane Database Syst Rev. 2000;(2)]. Cochrane Database of Systematic Reviews (3):CD000259.

Kan L, Olivotto IA, Warren Burhenne LJ, Sickles EA, Coldman AJ. 2000. Standardized abnormal interpretation and cancer detection ratios to assess reading volume and reader performance in a breast screening program. Radiology 215(2):563–567.

Karssemeijer N, Otten JDM, Verbeek ALM, Groenewoud JH, de Koning HJ, Hendriks JHCL, Holland R. 2003. Computer-aided detection versus independent double reading of masses on mammograms. Radiology 227(1):192–200.

Kerlikowske K, Carney PA, Geller B, Mandelson MT, Taplin SH, Malvin K, Ernster V, Urban N, Cutter G, Rosenberg R, Ballard-Barbash R. 2000. Performance of screening mammography among women with and without a first-degree relative with breast cancer. Annals of Internal Medicine 133(11):855–863.

Kerlikowske K, Grady D, Barclay J, Frankel SD, Ominsky SH, Sickles EA, Ernster V. 1998. Variability and accuracy in mammographic interpretation using the American College of Radiology Breast Imaging Reporting and Data System. Journal of the National Cancer Institute 90(23):1801–1809.

Kerlikowske K, Smith-Bindman R, Ljung BM, Grady D. 2003. Evaluation of abnormal mammography results and palpable breast abnormalities. Annals of Internal Medicine 139(4):274–284.

Page 77 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Khuri SF, Daley J, Henderson WG. 2002. The comparative assessment and improvement of quality of surgical care in the Department of Veterans Affairs. Archives of Surgery 137(1):20–27.

Kiefe CI, Allison JJ, Williams OD, Person SD, Weaver MT, Weissman NW. 2001. Improving quality improvement using achievable benchmarks for physician feedback: A randomized controlled trial. JAMA 285(22):2871–2879.

Klabunde CN, Sancho-Garnier H, Broeders M, Thoresen S, Rodrigues VJL, Ballard-Barbash R. 2001. Quality assurance for screening mammography data collection systems in 22 countries. International Journal of Technology Assessment in Health Care 17(4):528–541.

Kolb GR. 2000. Disease management is the future: Breast cancer is the model. Surgical Oncology Clinics of North America 9(2):217–232.

Kossoff M, Brothers L, Cawson J, Osborne J, Wylie E. 2003. BreastScreen Australia: How we handle variability in interpretive skills. Seminars in Breast Disease 6(3):123–127.

Landon BE, Normand SL, Blumenthal D, Daley J. 2003. Physician clinical performance assessment: Prospects and barriers. JAMA 290(9):1183–1189.

Laya MB, Larson EB, Taplin SH, White E. 1996. Effect of estrogen replacement therapy on the specificity and sensitivity of screening mammography. Journal of the National Cancer Institute 88(10):643–649.

Linver MN, Paster SB, Rosenberg RD, Key CR, Stidley CA, King WV. 1992. Improvement in mammography interpretation skills in a community radiology practice after dedicated teaching courses: 2-year medical audit of 38,633 cases. Radiology 184(1):39–43.

Litherland JC, Evans AJ, Wilson AR. 1997. The effect of hormone replacement therapy on recall rate in the National Health Service Breast Screening Programme. Clinical Radiology 52(4):276–279.

Mandelson MT, Oestreicher N, Porter PL, White D, Finder CA, Taplin SH, White E. 2000. Breast density as a predictor of mammographic detection: Comparison of interval- and screen-detected cancers. Journal of the National Cancer Institute 92(13):1081–1087.

Mansel RE. 2000. Should specialist breast units be adopted in Europe? A comment from Europe. European Journal of Cancer 36(18):2286–2287.

Mazmanian PE, Davis DA. 2002. Continuing medical education and the physician as a learner: Guide to the evidence. JAMA 288(9):1057–1060.

Meyer JE, Eberlein TJ, Stomper PC, Sonnenfeld MR. 1990. Biopsy of occult breast lesions. Analysis of 1261 abnormalities. JAMA 263(17):2341–2343.

Monsees BS, Destouet JM. 1992. A screening mammography program. Staying alive and making it work. Radiologic Clinics of North America 30(1):211–219.

Multidisciplinary coordination expedites care, builds volumes. 2003 (October 3). Oncology Watch.

National Committee for Quality Assurance. 2004. The State of Health Care Quality. Washington, DC: National Committee for Quality Assurance.

National Consortium of Breast Centers, Inc. 2004. Quality: What Do YOU Mean By “Quality”? [Online]. Available: http://www.breastcare.org [accessed December 10, 2004].

National Health Service. 2003. NHS Breast Screening Programme Annual Review 2003. NHS Breast Cancer Screening Programmes, Sheffield, United Kingdom.

National Radiographers Quality Assurance Coordinating Group. 2000. Quality Assurance Guidelines for Radiographers. 2nd ed. Publication No. 30. Sheffield, UK: NHSBSP Publications.

Newstead GM, Schmidt RA, Chambliss J, Kral ML, Edwards S, Nishikawa RM. 2003. Are radiology residents adequately trained in screening mammography? Comparison of radiology resident performance with that of general radiologists in a simulated screening exercise. [Abstract]. Radiology 229:405.

Page 78 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Nodine CF, Kundel HL, Mello-Thoms C, Weinstein SP, Orel SG, Sullivan DC, Conant EF. 1999. How experience and training influence mammography expertise. Academic Radiology 6(10):575–585.

Nystrom L, Rutqvist LE, Wall S, Lindgren A, Lindqvist M, Ryden S, Andersson I, Bjurstam N, Fagerberg G, Frisell J. 1993. Breast cancer screening with mammography: Overview of Swedish randomised trials. Lancet 341(8851):973–978.

Palmer RH, Hargraves JL. 1996. The ambulatory care medical audit demonstration project. Research design. Medical Care 34(9 Suppl):SS12–SS28.

Pankow JS, Vachon CM, Kuni CC, King RA, Arnett DK, Grabrick DM, Rich SS, Anderson VE, Sellers TA. 1997. Genetic analysis of mammographic breast density in adult women: Evidence of a gene effect. Journal of the National Cancer Institute 89(8):549–556.

Perry NM. 2003. Interpretive skills in the National Health Service Breast Screening Programme: Performance indicators and remedial measures. Seminars in Breast Disease 6(3):108–113.

Perry NM. 2004a (September 2). Mammography Quality and Performance in the National Health Service Breast Screening Programme. Presentation at the meeting of the Institute of Medicine Committee on Improving Mammography Quality Standards, Washington, DC.

Perry NM. 2004b. Breast cancer screening—the European experience. International Journal of Fertility & Women’s Medicine 49(5):228–230.

Persson I, Thurfjell E, Holmberg L. 1997. Effect of estrogen and estrogen-progestin replacement regimens on mammographic breast parenchymal density. Journal of Clinical Oncology 15(10):3201–3207.

Physician Insurers Association of America. 2002. Breast cancer study. 3rd ed. Rockville, MD: Physician Insurers Association of America.

Pisano ED, Yankaskas BC, Ghate SV, Plankey MW, Morgan JT. 1995. Patient compliance in mobile screening mammography. Academic Radiology 2(12):1067–1072.

Poplack SP, Tosteson AN, Grove MR, Wells WA, Carney PA. 2000. Mammography in 53,803 women from the New Hampshire Mammography Network. Radiology 217(3):832–840.

Porter PL, El-Bastawissi AY, Mandelson MT, Lin MG, Khalid N, Watney EA, Cousens L, White D, Taplin S, White E. 1999. Breast tumor characteristics as predictors of mammographic detection: Comparison of interval- and screen-detected cancers. Journal of the National Cancer Institute 91(23):2020–2028.

Rabinowitz B. 2000. Psychologic issues, practitioners’ interventions, and the relationship of both to an interdisciplinary breast center team. Surgical Oncology Clinics of North America 9(2):347–365.

Rabinowitz B. 2004. Interdisciplinary breast cancer care: Declaring and improving the standard. Oncology (Huntington) 18(10):1263–1268.

Raza S, Rosen MP, Chorny K, Mehta TS, Hulka CA, Baum JK. 2001. Patient expectations and costs of immediate reporting of screening mammography: Talk isn’t cheap. American Journal of Roentgenology 177(3):579–583.

Records SF. 1995. Female breast cancer is most prevalent cause of malpractice claims. Journal of the Oklahoma State Medical Association 88(7):311–312.

Reis LAG, Miller BA, Hankey BF. 1994. SEER cancer statistics review, 1973–1991. Bethesda, MD: National Cancer Institute.

Roberts MM, Alexander FE, Anderson TJ, Chetty U, Donnan PT, Forrest P, Hepburn W, Huggins A, Kirkpatrick AE, Lamb J. 1990. Edinburgh trial of screening for breast cancer: Mortality at seven years. Lancet 335(8684):241–246.

Page 79 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Robertson MK, Umble KE, Cervero RM. 2003. Impact studies in continuing education for health professions: Update. Journal of Continuing Education in the Health Professions 23(3):146–156.

Roblin DW. 1996. Applications of physician profiling in the management of primary care panels. Journal of Ambulatory Care Management 19(2):59–74.

Ross G, Johnson D, Castronova F. 2000. Physician profiling decreases inpatient length of stay even with aggressive quality management. American Journal of Medical Quality 15(6):233–240.

Rutter CM, Taplin S. 2000. Assessing mammographers’ accuracy. A comparison of clinical and test performance. Journal of Clinical Epidemiology 53(5):443–450.

Saftlas AF, Hoover RN, Brinton LA, Szklo M, Olson DR, Salane M, Wolfe JN. 1991. Mammographic densities and risk of breast cancer. Cancer 67(11):2833–2838.

Scheiden R, Sand J, Tanous AM, Capesius C, Wagener C, Wagnon MC, Knolle U, Faverly D. 2001. Consequences of a national mammography screening program on diagnostic procedures and tumor sizes in breast cancer. A retrospective study of 1540 cases diagnosed and histologically confirmed between 1995 and 1997. Pathology, Research & Practice 197(7):467–474.

Schwartz LM, Woloshin S, Sox HC, Fischhoff B, Welch HG. 2000. U.S. women’s attitudes to false-positive mammography results and detection of ductal carcinoma in situ: Cross-sectional survey. Western Journal of Medicine 173(5):307–312.

Shapiro S, Venet W, Strax P, Venet L. 1988. Periodic Screening for Breast Cancer: The Health Insurance Plan Project and its Sequelae, 1963–1968. Baltimore, MD: Johns Hopkins University Press.

Sickles EA. 1992. Quality assurance. How to audit your own mammography practice. Radiologic Clinics of North America 30(1):265–275.

Sickles EA. 1995a. How to conduct an audit. In: Kopans DB, ed. Categorical Course in Breast Imaging. Oak Brook, IL: Radiological Society of North America. Pp. 81–91.

Sickles EA. 1995b. Latent image fading in screen-film mammography: Lack of clinical relevance for batch-processed films. Radiology 194(2):389–392.

Sickles EA. 2003. The American College of Radiology’s Mammography Interpretive Skills Assessment (MISA) examination. Seminars in Breast Disease 6(3):133–139.

Sickles EA, Miglioretti DL, Ballard-Barbash R, Geller BM, Leung JW, Rosenberg RD, Smith-Bindman R, Yankaskas BC. In press. Performance benchmarks for diagnostic mammography. Radiology.

Sickles EA, Weber WN, Galvin HB, Ominsky SH, Sollitto RA. 1986. Mammographic screening: How to operate successfully at low cost. Radiology 160(1):95–97.

Sickles EA, Wolverton DE, Dee KE. 2002. Performance parameters for screening and diagnostic mammography: Specialist and general radiologists. Radiology 224(3):861–869.

Silverstein MJ. 1973. The multidisciplinary breast clinic—a new approach. UCLA Cancer Bulletin 1:5.

Silverstein MJ. 2000. State-of-the-art breast units—a possibility or a fantasy? A comment from the U.S. European Journal of Cancer 36(18):2283–2285.

Smith RA, Cokkinides V, Eyre HJ. 2005. American Cancer Society guidelines for the early detection of cancer, 2005. CA: A Cancer Journal for Clinicians 55(1):31–44.

Smith RA, D’Orsi C. 2004. Screening for breast cancer. In: Harris JR, Lippman ME, Morrow M, Osborne CK, eds. Diseases of the Breast. New York: Lippincott Williams & Wilkins. Pp. 103–130.

Page 80 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Smith-Bindman R, Chu P, Miglioretti D, Quale C, Rosenberg RD, Cutter G, Geller B, Bacchetti P, Sickles EA, Kerlikowske K. 2005. Physician predictors of mammographic accuracy. Journal of the National Cancer Institute 97(5):358–367.

Smith-Bindman R, Chu PW, Miglioretti DL, Sickles EA, Blanks R, Ballard-Barbash R, Bobo JK, Lee NC, Wallis MG, Patnick J, Kerlikowske K. 2003. Comparison of screening mammography in the United States and the United Kingdom. JAMA 290(16):2129–2137.

Spoeri RK, Ullman R. 1997. Measuring and reporting managed care performance: Lessons learned and new initiatives. Annals of Internal Medicine 127(8 Pt 2):726–732.

Steinberg KK, Thacker SB, Smith SJ, Stroup DF, Zack MM, Flanders WD, Berkelman RL. 1991. A meta-analysis of the effect of estrogen replacement therapy on the risk of breast cancer. JAMA 265(15):1985–1990.

Sung HY, Kearney KA, Miller M, Kinney W, Sawaya GF, Hiatt RA. 2000. Papanicolaou smear history and diagnosis of invasive cervical carcinoma among members of a large prepaid health plan. Cancer 88(10):2283–2289.

Tabar L, Fagerberg G, Duffy SW, Day NE, Gad A, Grontoft O. 1992. Update of the Swedish two-county program of mammographic screening for breast cancer. Radiologic Clinics of North America 30(1):187–210.

Taplin SH, Ichikawa LE, Kerlikowske K, Ernster VL, Rosenberg RD, Yankaskas BC, Carney PA, Geller BM, Urban N, Dignan MB, Barlow WE, Ballard-Barbash R, Sickles EA. 2002. Concordance of Breast Imaging Reporting and Data System (BI-RADS) assessments and management recommendations in screening mammography. Radiology 222(2):529–535.

Taplin SH, Rutter CM, Lehman C. Submitted. Testing the effect of computer assisted detection upon interpretive performance in screening mammography.

Theberge I, Hebert-Croteau N, Langlois A, Major D, Brisson J. 2005. Volume of screening mammography and performance in the Quebec population-based Breast Cancer Screening Program. CMAJ Canadian Medical Association Journal 172(2):195–199.

Thomson-O’Brien MA, Oxman AD, Davis DA, Haynes RB, Freemantle N, Harvey EL. 2004. Audit and feedback versus alternative strategies: Effects on professional practice and health care outcomes. [Review]. Cochrane Database of Systematic Reviews (2):CD000260.

Thurfjell EL, Lernevall KA, Taube AA. 1994. Benefit of independent double reading in a population-based mammography screening program. Radiology 191(1):241–244.

Tosteson AN, Begg CB. 1988. A general regression methodology for ROC curve estimation. Medical Decision Making 8(3):204–215.

Tripathy D. 2004. Interdisciplinary breast cancer care: Declaring and improving the standard. [Review]. Oncology (Huntington) 18(10):1270–1275.

U.S. Preventive Services Task Force. 2002. Screening for breast cancer: Recommendations and rationale. Annals of Internal Medicine 137(5 Pt 1):344–346.

van der Horst F, Hendriks JHCL, Rijken HJTM, Holland R. 2003. Breast cancer screening in the Netherlands: Audit and training of radiologists. Seminars in Breast Disease 6(3):114–122.

van Landeghem P, Bleyen L, De Backer G. 2002. Age-specific accuracy of initial versus subsequent mammography screening: Results from the Ghent Breast Cancer-Screening Programme. European Journal of Cancer Prevention 11(2):147–151.

Veterans Health Administration. 2004. Quality Management (QM) and Patient Safety Activities that Can Generate Confidential Documents. Department of Veterans Affairs, VHA Directive 2004–051. Washington, DC: Veterans Health Administration.

Wang SJ, Middleton B, Prosser LA, Bardon CG, Spurr CD, Carchidi PJ, Kittler AF, Goldszer RC, Fairchild DG, Sussman AJ, Kuperman GJ, Bates DW. 2003. A cost-benefit analysis of electronic medical records in primary care. American Journal of Medicine 114(5):397–403.

Page 81 Cite

Suggested Citation:"2 Improving Interpretive Performance in Mammography." Institute of Medicine and National Research Council. 2005. Improving Breast Imaging Quality Standards. Washington, DC: The National Academies Press. doi: 10.17226/11308.

×

Warren-Burhenne L. 2003. Screening Mammography Program of British Columbia standardized test for screening radiologists. Seminars in Breast Disease 6(3):140–147.

Warren-Burhenne LJ, Wood SA, D’Orsi CJ, Feig SA, Kopans DB, O’Shaughnessy KF, Sickles EA, Tabar L, Vyborny CJ, Castellino RA. 2000. Potential contribution of computer-aided detection to the sensitivity of screening mammography. Radiology 215(2):554–562.

Waynant RW, Chakrabarti K, Kaczmarek RA, Dagenais I. 1999. Testing optimum viewing conditions for mammographic image displays. Journal of Digital Imaging 12(2 Suppl 1):209–210.

Weiner JP, Parente ST, Garnick DW, Fowles J, Lawthers AG, Palmer RH. 1995. Variation in office-based quality. A claims-based profile of care provided to Medicare patients with diabetes. JAMA 273(19):1503–1508.

Weiss KB, Wagner R. 2000. Performance measurement through audit, feedback, and profiling as tools for improving clinical care. Chest 118(2 Suppl):53S–58S.

White E, Miglioretti DL, Yankaskas BC, Geller BM, Rosenberg RD, Kerlikowske K, Saba L, Vacek PM, Carney PA, Buist DS, Oestreicher N, Barlow W, Ballard-Barbash R, Taplin SH. 2004. Biennial versus annual mammography and the risk of late-stage breast cancer. Journal of the National Cancer Institute 96(24):1832–1839.

Wolk RB. 1992. Hidden costs of mobile mammography: Is subsidization necessary? American Journal of Roentgenology 158(6):1243–1245.

Wooding D. 2003. PERsonal perFORmance in Mammographic Screening. [Online]. Available: http://ibs.derby.ac.uk/performs/index.shtml [accessed May 12, 2004].

Yankaskas BC, Cleveland RJ, Schell MJ, Kozar R. 2001. Association of recall rates with sensitivity and positive predictive values of screening mammography. American Journal of Roentgenology 177(3):543–549.

Yankaskas BC, Klabunde CN, Ancelle-Park R, Renner G, Wang H, Fracheboud J, Pou G, Bulliard JL. 2004. International comparison of performance measures for screening mammography: Can it be done? Journal of Medical Screening 11(4):187–193.

Yankaskas BC, Taplin SH, Ichikawa L, Geller BM, Rosenberg RD, Carney PA, Kerlikowske K, Ballard-Barbash R, Cutter GR, Barlow WE. 2005. Association between mammography timing and measures of screening performance in the United States. Radiology 234 (2):363–373.

Zapka JG, Taplin SH, Solberg LI, Manos MM. 2003. A framework for improving the quality of cancer care: The case of breast and cervical cancer screening. Cancer Epidemiology, Biomarkers & Prevention 12(1):4–13.

Zheng Y, Barlow W, Cutter G. 2005. Assessing accuracy of mammography in the presence of verification bias and intrareader correlation. Biometrics 61(1):259–268.