7
Assessing Operational Suitability

Fielding operationally suitable systems is a prime objective of defense acquisition. A suitable weapon system is one that is available for combat when needed, is reliable enough to accomplish its mission, operates satisfactorily with service personnel and other systems, and does not impose an undue logistics burden in peacetime or wartime.1 As noted above, operational test and evaluation is statutorily required to assess the effectiveness and suitability of defense systems under consideration for procurement.

Scarce resources, increasing technological complexity, and increasing attention to the life-cycle costs of defense systems underscore the need for assurance of suitability and its elements. Experience in the Department of Defense (DoD), similar to that of private industry, shows that the life-cycle maintenance cost of a major system may substantially exceed its original acquisition cost. For example, the total procurement cost of the Longbow Apache helicopter is estimated at $5.3 billion, which is slightly more than one-third of the total estimated 20-year lifecycle cost of $14.3 billion.2

1  

This informal definition is borrowed from Bridgman and Glass (1992:1). For purposes of test and evaluation, operational suitability is defined officially in DoD Instruction 5000.2 as ''the degree to which a system can be placed satisfactorily in field use with consideration given to availability. compatibility, transportability, interoperability, reliability, wartime usage rates, maintainability. safety, human factors, manpower supportability, logistics supportability, natural environmental effects and impacts, documentation, and training requirements."

2  

These estimates, in constant fiscal 1994 dollars, are provided in Annex D of the Longbow Apache Test and Evaluation Master Plan, which cites December 1993 estimates from the Longbow Program Office and the President's fiscal 1995 budget as the original sources.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 101
7 Assessing Operational Suitability Fielding operationally suitable systems is a prime objective of defense acquisition. A suitable weapon system is one that is available for combat when needed, is reliable enough to accomplish its mission, operates satisfactorily with service personnel and other systems, and does not impose an undue logistics burden in peacetime or wartime.1 As noted above, operational test and evaluation is statutorily required to assess the effectiveness and suitability of defense systems under consideration for procurement. Scarce resources, increasing technological complexity, and increasing attention to the life-cycle costs of defense systems underscore the need for assurance of suitability and its elements. Experience in the Department of Defense (DoD), similar to that of private industry, shows that the life-cycle maintenance cost of a major system may substantially exceed its original acquisition cost. For example, the total procurement cost of the Longbow Apache helicopter is estimated at $5.3 billion, which is slightly more than one-third of the total estimated 20-year lifecycle cost of $14.3 billion.2 1   This informal definition is borrowed from Bridgman and Glass (1992:1). For purposes of test and evaluation, operational suitability is defined officially in DoD Instruction 5000.2 as ''the degree to which a system can be placed satisfactorily in field use with consideration given to availability. compatibility, transportability, interoperability, reliability, wartime usage rates, maintainability. safety, human factors, manpower supportability, logistics supportability, natural environmental effects and impacts, documentation, and training requirements." 2   These estimates, in constant fiscal 1994 dollars, are provided in Annex D of the Longbow Apache Test and Evaluation Master Plan, which cites December 1993 estimates from the Longbow Program Office and the President's fiscal 1995 budget as the original sources.

OCR for page 101
Suitability deficiencies have been responsible for many of the field problems with newly acquired systems and have generated concerns about the operational readiness of certain military capabilities. Concern about the department's success in fielding suitable systems was expressed in an October 1990 memorandum from the Deputy Undersecretary of Defense for Acquisition. A 1992 Logistics Management Institute review of seven recently fielded systems found that "several systems have not achieved their availability goals, and they consume significantly more logistics resources than anticipated" (Bridgman and Glass, 1992:ii). That study also found that crucial suitability issues are not adequately identified early, or addressed in operational test plans. Such concerns and findings have led to calls for improved assessment of operational suitability. In this chapter, we discuss statistical issues related to the conduct of operational suitability tests and their evaluations and related information-gathering activities. SUITABILITY, TESTING AND EVALUATION, AND STATISTICS In considering the role of statistical methods and principles in the assessment of operational suitability, it is important to establish first the context of suitability assessment and its relationship to operational test and evaluation. The judgment that a system is "suitable" implies confirmation that the system's use can be adequately supported in its anticipated environments. A comprehensive study of a system's suitability should include, for example, considerations such as its compatibility with other systems in current use, transportability, supportability (in terms of logistics and manpower), reliability, availability, and maintainability. A definitive suitability assessment addresses all of these matters, reporting both on the outcomes of formal tests and on informal observations and judgments made by those charged with the system's evaluation. The suitability of a system should be assessed from early development through fielding, even though all elements of suitability may not be demonstrable at all times. The reliability of key equipment, especially critical components, can be demonstrated relatively early in a system's development, but some questions about logistics supportability can only be definitively answered after a system is fielded. Because the assessment of suitability after an operational test is performed is limited by the scope of such testing, it may be desirable to augment the test results with modeling and simulation, test and field data from other sources, and related analyses. The challenges faced in designing and interpreting results of operational tests underscore the need for, and the potential value of, applying sound statistical methodology. For virtually every aspect of operational suitability, fundamental statistical questions arise concerning the duration and conditions of testing, the measurement and processing of suitability data, and the use of information from other sources, such as developmental tests and simulation models, in the design and subsequent analysis of operational tests. Possible test augmentation proce

OCR for page 101
dures involve such advanced statistical tools as hierarchical models, reliability growth models, and accelerated life testing. Difficulties frequently encountered in suitability assessment can be overcome through a combination of: tests dedicated to providing information on operational suitability; other sources of information about the suitability of the system (e.g., developmental tests, data from similar systems); and statistical analysis and modeling to integrate all relevant sources of reliability, availability, and maintainability (RAM) information about the system in question. Such activities are currently undertaken in the defense testing community with varying degrees of frequency and effectiveness. Although the panel has gathered information on, and has had substantial exposure to, many different aspects of the military services' practices in assessing the suitability of a potential defense procurement, we have restricted our formal review to the elements of operational suitability in which the statistical issues appear most prominently: reliability, availability, and maintainability. RELIABILITY, AVAILABILITY, AND MAINTAINABILITY Basic Definitions Reliability is the probability that a system (machine, device, etc.) will perform its intended function under appropriate operating conditions for a specified period of time.3 Clearly, component reliability is not the same as system reliability. The various services have their own definitions of reliability that are similar to the one presented here (see, e.g., U.S. Department of Defense, 1994b:7-1). Availability has been defined many different ways, for example: "a measure of the degree to which an item is in an operable state at the start of a mission when the mission is called for at a random point in time," and "the proportion of time a system is either operating or is capable of operating when used in a specific manner in a typical maintenance and supply environment" (U.S. Department of Defense, 1994b:7-3). Maintainability has been defined to be "a measure of the ability of an item to be retained in, or restored to, a specified condition when maintenance is performed by personnel having specified skill levels, using prescribed procedures and resources" (U.S. Department of Defense, 1994b:7-2). Each definition requires the use of a set of methods in order to make appropriate and statistically supportable conclusions about its numerical value. 3   The term time is used in a generic sense in this chapter and can refer to other measurements, such as mileage or cycles of operation.

OCR for page 101
Statistical and Broader Methodological Issues RAM assessment is among the most statistically intensive subjects of investigation in the entire defense acquisition process. Six methodological issues underlie our concerns about both the design and the evaluation of tests that serve as sources of operational RAM data and the basis for reaching conclusions about RAM: the relative prominence of effectiveness, rather than suitability (particularly, RAM) considerations in designing operational tests; the applicability of commercial industrial practices and standards concerning system reliability, availability, and maintainability; the questionable validity of some statistical assumptions commonly made in determining test resource requirements; the subjective scoring of "mission critical" failures in processing data for RAM evaluation; current and potential uses of multiple sources of RAM information in the design of operational tests and in the analysis of test results, including the collection and use of field data for continuing RAM assessment after a system is acquired; and current and potential uses of special statistical methods for (a) assessing reliability changes over time and (b) using accelerated stress testing methods for assessing reliability. The first two issues relate to key concepts, processes, and organizational structures that affect how RAM assessment can better realize potential gains from improvements in statistical practice. The last four issues relate to technical aspects of statistical practice in the design, execution, and analysis of related operational testing. In the remainder of this chapter we assess the current state of related operational testing in the services with respect to these issues and put forward recommendations aimed at strengthening its overall quality, efficiency, and utility. Considerations in Operational Test Planning Despite the major impact of suitability problems on a system's life-cycle cost, shortcomings with respect to measures of effectiveness are generally regarded to be more "serious" problems for prospective defense systems than are deficiencies with respect to RAM-related measures of suitability. System effectiveness is a sine qua non: without its demonstration, further development and procurement of a prospective system is insupportable. Yet a lack of equally rigorous RAM testing and evaluation can quickly lead to unacceptable field performance of the system.

OCR for page 101
Plans to assess operational suitability in support of a procurement decision must consider two prominent questions: (1) How can one be sure the key RAM issues have been identified? (2) How does one make specific decisions with a reasonable degree of confidence about these issues, especially those concerning system availability and maintainability? Answering these questions is particularly important in view of deficiencies in recent field experiences with newly acquired systems. Such deficiencies raise concerns not only about military readiness, but also about opportunity costs. Unexpectedly high costs in maintaining a system after procurement reduces the resources available for additional military capability or further system development. The assessment of suitability—and RAM in particular—deserves increased emphasis at all stages of a system's life cycle, including operational test and evaluation. Recommendation 7.1: The Department of Defense and the military services should give increased attention to their reliability, availability, and maintainability data collection and analysis procedures because deficiencies continue to be responsible for many of the current field problems and concerns about military readiness. A primary objective of such increased attention should be to improve the identification of key RAM parameters and the specification of associated requirements. These key parameters could cause unacceptable system performance or cost because they consume significant logistics resources or pose risks associated with major changes in system technology, operational concept, or support concept (Bridgman and Glass, 1992). In accordance with federal statutes and DoD policy directives, test planners must decide how each RAM measure will be addressed, and the Director, Operational Test and Evaluation (DOT&E) must render an opinion as to whether that treatment is adequate. However, our review of case studies suggests that operational test planning does not always account explicitly for the type and quantity of information needed to address key RAM issues. For example, the collection of RAM data often depends heavily on how an operational test unfolds—in terms of "casualties," actions of the soldiers involved, the length of battles, and similar characteristics. Such a dynamic mode of data collection creates enormous statistical difficulties when analyzed as a sequential experiment. In such cases, particularly in force-on-force tests, there is little hope of predicting accurately the amount of RAM data that will be obtained during the operational test event and, consequently, little opportunity to use statistical methods in planning the test to meet RAM information requirements. A common approach in operational testing seems to be to do the testing necessary to assess effectiveness and accumulate the requisite amount of RAM data in a resourceful manner. The Longbow Apache test appears to be typical in that the size of the force-on-force test was primarily determined by considerations of effectiveness, rather than those of suitability. The primary data sources for measuring the reliability,

OCR for page 101
availability, and maintainability performance of the Longbow system were the gunnery and force-on-force tests conducted as part of the initial operational test and evaluation. Secondary data sources included developmental testing, logistical demonstrations, and force development test and experimentation. According to the test and evaluation plan, "Secondary data sources will be used to supplement [operational test] results by demonstrating historical performance, helping identify trends in anomalies found in the primary MOPs [measures of performance], and characterizing performance in conditions not encountered in the [initial operational test]" (Hall et al., 1994:2-92). The test and evaluation plan also stated that, because of the different conditions obtained under gunnery and force-on-force testing, results from the two phases would not be merged but, instead, would be presented separately in the evaluation report. For the RAM related measures of performance, however, the operating hours appear to reflect the accumulated hours under both the gunnery and force-on-force phases. While we acknowledge the inherently uncontrollable aspects of some forms of operational testing, such as force-on-force trials, and the incompleteness of the logistics support structure prior to fielding, we believe that more statistically designed RAM testing of prospective systems can and should be performed. Formal statistical approaches in designing these events—which can be called "operational suitability tests"—will ensure more rigorous assessment of key RAM issues. Furthermore, statistical test design concepts provide a conceptual structure within which costs and benefits of data collection can be weighed and test resources can be allocated optimally. As illustrated in the C-130H program (see Appendix B), RAM considerations can drive the scope of operational test events, particularly in determining the number of units to be tested and the total time on test. In such cases, improvements in the application of statistical methods in test design could lead to more efficient use of resources. Recommendation 7.2: The Test and Evaluation Master Plan and associated documents should include more explicit discussion of how reliability, availability, and maintainability issues will be addressed, particularly the extent to which operational testing, modeling, simulation, and expert judgment will be used. The military services should make greater use of statistically designed tests to assess reliability, availability, and maintainability and related measures of operational suitability. The effective statistical design of operational suitability tests requires, as a preliminary step, that the rationale behind each RAM-related requirement be adequately documented. In this way test designers can specify reasonable requirements for determining the test sample sizes needed to attain a required precision. The need for underlying documentation, rationale and support is especially crucial for RAM measures that may significantly affect field cost and performance. If the 20-year life cycle costs of the Longbow Apache are more

OCR for page 101
than twice the cost of its procurement, even a modest amount of imprecision in estimating system availability might have significant consequences when estimating the system's life-cycle costs. Historically, numerical operational requirements for RAM measures are usually specified and sometimes extensively documented in RAM rationale reports, but justification is rarely given for particular numerical goals. Reference is sometimes made to the RAM requirements or performance of a baseline system against which a new system is being evaluated; rarely, though, is there an explanation of the relationship between a RAM goal and the system's performance or cost of maintenance. Also missing is any discussion of the implications for performance—and the cost of failing—if the system's demonstrated RAM characteristics fall below the numerical goals. Experience suggests that the RAM requirements specified in operational test and evaluation are not so rigid that failure to attain a numerical goal necessarily results (or should result) in cancellation of the system. Indeed, such criteria should permit a balancing of a variety of considerations. If operational requirement documents incorporate performance and cost information, decision makers can better determine how this balancing should be done in setting RAM criteria. Recommendation 7.3: As part of an increased emphasis on assessing the reliability, availability, and maintainability of prospective defense systems, reasonable criteria should be developed for each system. Such criteria should permit a balancing of a variety of considerations and be explicitly linked to estimates of system cost and performance. A discussion of the implications for performance, and cost of failing, if the system's demonstrated reliability, availability, and maintainability characteristics fall below numerical goals should be included. One effective way to express how attainment of a reliability goal should interact with test size is to specify in the Test and Evaluation Master Plan, for a given sample size, the minimum observable suitability value that could be accepted. STATISTICAL ASSUMPTIONS IN DETERMINING RAM TEST DESIGN AND RESOURCES Suppose that a test is designed to assess whether a proposed new component of a system possesses a mean time to failure (MTTF)4 of at least µ0 hours. Moreover, µ0 is significantly larger than the corresponding MTTF of µ1 hours for an existing component that would be replaced by the prospective component if it 4   We are ignoring for the purposes of this discussion. the difference between mean time between failure and mean time between operational mission failure. This difference relates to the issue of combining developmental and operational test data in that they are distinct concepts.

OCR for page 101
is acquired. (This example is, of course, a simplification; in practice, a prospective component would be expected to meet several RAM-related requirements. For ease of exposition, we ignore such multivariate aspects.) The fundamental question is determining whether the prospective component system meets its MTTF requirement. This decision problem can be framed as a statistical significance test, where one hypothesis is "the true MTTF of the component under study is equal to µ0" and the second hypothesis is "the true MTTF is µ1, where µ1 < µ0." Both µ1 and µ0 have the property that rejecting a component with reliability greater than µ0 is costly, as is accepting a component with reliability less than µ1 For reliabilities between µ1 and µ0, a decision maker is relatively indifferent to the question of accepting the component. One key question that arises in this context is how much testing is enough: How many production components must be tested and for how long'? To answer this question precisely requires that an analyst make specific assumptions. Foremost of these is the specification of a statistical model for the distribution of observed time to failure of the component. The choice of this statistical model has important consequences for test design and interpretation. We comment below on the difficulties, risks, and missed opportunities attendant to the uncritical use of models and procedures based on questionable statistical assumptions. We also consider the potential for making more reliable inferences when information from various sources is fully exploited to evaluate the aptness of proposed statistical models, and we suggest reasonable alternatives. Exponential Life Testing: Current Uses and Limitations One particular statistical model has been quite widely used for RAM testing with applications far outnumbering those of competing approaches. The model is that of exponential life testing, in which the distribution of observed times to failure is assumed to be well described by an exponential probability distribution. The popularity of the exponential model is due to its analytical tractability, the clarity with which one can proceed, and empirical and theoretical evidence that some systems—for example, some purely electronic component systems—often have failure times that satisfy this model to a reasonable degree of approximation.5 The first two reasons lead, for many different experimental designs, to an exact analysis that can determine, in advance, the resources required to meet specified bounds or requirements for confidence levels or error probabilities. And given that many non-electronic component systems have non-exponential failure times, tests based on the exponential assumption can be "conservative." This means that the actual producer's risk (the probability of incorrectly conclud 5   However, some electronic components suffer from "infant mortality," that is. a short initial phase in which the failure rate is decreasing, typically due to manufacturing defects.

OCR for page 101
ing that the mean time to failure is less than the requirement) and the actual consumer's risk (the probability of incorrectly concluding that the mean time to failure is more than the requirement) will be smaller than the nominal levels for which the test is planned. Department of Defense Handbook H108 describes in detail how the µ0 versus µ1 hypothesis test should be carried out under the assumption of exponentially distributed times to failure. As explained in that handbook and in other references (e.g., U.S. Department of Defense, 1982; Lawless, 1982; Bain and Engelhardt, 1991), one proceeds by specifying desired levels of α (the producer's risk) and β (the consumer's risk), and then identifying the number of observed failures, r, and maximum test time, T, required to resolve the test at the prescribed levels of α and β. The test is then executed by rejecting the hypothesis that "the true MTTF is µ0" if the total time on test T at the time of the r-th failure is less than some prescribed number. An appealing feature of exponential life testing is that it easily shows that the duration of the test can be made smaller by placing a larger number of component systems on test than the critical number r. The simplicity that stems from the assumption of exponentially distributed times to failure derives from the fact that the performance of these tests depends on the hypothesized means µ0 and µ1 only through the discrimination ratio, D = µ1/µ0. The required number of observed failures (r) and the required total test time (T) can be computed explicitly given the two error probabilities (α and β) and this discrimination ratio (see U.S. Department of Defense, 1960:Table 2B-5, which provides the required number of observed failures r and the ratio T/µ0 for selected values of D, α, and β). Using such tables, one can design and carry out reliability tests with relative ease. However, a number of difficulties arise when the underlying assumption is not met, when failure times do not closely follow an exponential distribution. In cases for which the exponential test design is conservative, the opportunity to carry out a more efficient test or to save test resources is foregone. And conclusions based on exponential assumptions may differ substantially from the conclusions that would be drawn using more plausible statistical assumptions (Zelen and Dannemiller, 1961). It can thus be very important for an analyst to determine when the exponential model is of dubious validity and to use an alternative analysis in such cases.6 Alternatives to Exponential Life Testing A key implication of exponentially distributed times to failure is that the conditional probability of experiencing a failure in any small time interval of 6   When assessing whether an exponential model might be appropriate. it is often important to distinguish between different failure modes. For example. "start-up" failures might observe a different model than "operating failures."

OCR for page 101
fixed length, given survival to the starting point of the interval, does not depend on how long the equipment has been operating. In circumstances in which such an assumption of a "constant failure" or "hazard rate" is plausible (e.g., some electronic components), the exponential distribution model for time to failure provides a reasonable representation of equipment reliability. However, in some applications the hazard rate is relatively high in the early stages of operation due to manufacturing defects, i.e. infant mortality problems. Also, some equipment (for example, mechanical systems) experience a "wear-out" phase when the hazard rate becomes relatively high because of aging. The hazard rate of such equipment will, when plotted versus time, exhibit a U-shape, the so-called "bathtub" curve. When the hazard rate is expected to change over time, alternatives to the exponential failure distribution model must be considered. Many alternative models have been discussed in the statistical literature, including the Weibull, lognormal, and gamma distributions. In particular, the Weibull model has an "aging" parameter that can be used to represent increasing or decreasing hazard rates, and it includes the exponential as a special case. The exponential model is also often inappropriate when the (component) system is repairable. For repairable systems, the observed data can be in the form of either times between failures or numbers of observed failures in different time intervals. The assumption that times between failures are independent and identically distributed according to an exponential distribution, or equivalently a homogeneous Poisson process model for the number of failures, is also often inappropriate. This assumption implies that when a system is repaired, it is restored to "as good as new" condition and the rate with which failures occur over time (failure intensity) is constant. There are many alternatives to this model that have been developed in the statistical literature (see Cox and Lewis, 1972; Ross, 1996; and Ascher and Feingold, 1984). One approach is to use a general renewal process model where the times between failures are still independent and identically distributed but are allowed to be non-exponential (e.g., Weibull, lognormal, gamma, etc.). Other alternatives include modeling the number of observed failures as a mixture of Poisson processes or as a non-homogeneous Poisson process. The former leads to a compound Poisson distribution for the number of failures and is useful in situations where the variance is larger than the mean (overdispersed Poisson). The non-homogeneous Poisson process allows the failure intensity to vary over time and hence can be used to model system degradation or reliability growth. A careful treatment of data from an operational test presumes that the models employed for the number and timing of observed failures are selected with attention to the special features of the application and to the quality of the fit of the model to the available data. Our general thesis is that it is important to move beyond exponential life testing. We develop one particular alternative in some detail in the following section as an example of the potential benefits (and possible risks) of non-expo

OCR for page 101
nential modeling, but as we have already noted, there is an extensive literature on this topic. An analyst can easily find instructions for carrying out exponential life tests—including a host of military documents like DoD's RAM Primer and a virtual plethora of military handbooks and military standards, but it is harder to find instructions for alternative analyses within the DoD reliability literature. However, these alternative models have been discussed extensively in the statistical literature in recent years. Unlike the exponential model where exact, analytical solutions can be easily obtained, these models require approximate, numerical methods for reliability test planning and evaluation. However, this is not a major concern with modern day computing resources. There are also tables and software routines that are available in the literature (e.g., see Escobar and Meeker (1994) for the LINF algorithm and Meeker and Nelson (1976) for tables and figures that can be used to plan Weibull life-tests). With some of the more recent statistical software packages, it is an easy matter to fit other distributions such as Weibull, lognormal, and gamma to failure time data (even with complicated censoring patterns, e.g., not knowing when a system was first put on test). Concern about the overuse (and misuse) of simple models is not new. It has been amply documented in the research literature dealing with defense analysis, as well as in other fields. But certain consequences of the use of inappropriate models are less well understood than others. A question that needs to be explored in some detail, for which we have made a start, is the potential that exists for resource savings when one recognizes in advance that an alternative model is appropriate. An Example of Alternative Statistical Analysis: Weibull Life Testing The Weibull lifetime model is a common alternative to the exponential time-to-failure model in applications involving non-repairable (or perfectly repairable) systems or components. Since the family of Weibull probability distributions contains the family of exponential distributions as a special case, it represents a generalization of the exponential model rather than a rejection of it. Despite the extensive statistical literature on modeling and inference based on the Weibull distribution (see, e.g., Nelson and Meeker, 1978; Meeker and Nelson, 1976, 1977; Meeker, Escobar, and Hill, 1992;), there has been relatively little work that is directly applicable to hypothesis test design and execution; Lawless (1982) mentions: ''life test plans under the Weibull model have not been thoroughly investigated . . . it is almost always impossible to determine exact small-sample properties or to make effective comparisons of plans . . . further development of test plans under a Weibull model would be useful." As pointed out above, our examination of the Weibull model should not be taken as an endorsement of Weibull life testing. It should, instead, be taken simply as an example of an alternative approach having the potential for utilizing resources more effectively

OCR for page 101
A disagreement arose between testers and system designers concerning how to score an event in which the critical workstation fails and operators subsequently switch to one of two non-critical workstations within 3 minutes. One procedure would charge the system with a "critical fault" in order to estimate its reliability, but its availability would be relatively unaffected since the system is recovered within 3 minutes. In another example, initial Marine Corps plans for operational testing of the Advanced Amphibious Assault Vehicle (AAAV) involved force-on-force engagements. In the absence of extensive instrumentation to record events during the test, human observers would be strategically placed to produce a qualitative assessment of the system's performance. Because of concerns about the consistency, validity, and unbiased uses of such scoring, these plans are currently undergoing revision and may ultimately lead to testing at a well-instrumented site. Such instances underscore the need for objective failure definitions and scoring criteria, as well as rigorous documentation of the actual scoring of RAM data and subsequent evaluation. In many cases, precise measurements of failures and repair times observed during testing are "processed" into "rough" estimates of such characteristics as mean time to failure and mean time to repair, which are then combined with assumptions about system operating tempo and logistics support to produce estimates of operational availability. The uncertainty and arbitrariness inherent in this process, and the sensitivity of the final estimates to changes in assumptions, are rarely discussed prominently; consequently, the appearance of mathematical sophistication and precision in reporting the resulting suitability measures may be very misleading. Recommendation 7.6: Service test agencies should carefully document, in advance of operational testing, the failure definitions and criteria to be used in scoring reliability, availability, and maintainability data. The objectivity of the scoring procedures that were actually implemented should be assessed and included in the reporting of results. The sensitivity of final reliability, availability, and maintainability estimates to plausible alternative interpretations of test data, as well as subsequent assumptions concerning operating tempo and logistics support, should be discussed in the reporting. USE OF AUXILIARY INFORMATION IN RAM TEST DESIGN AND EVALUATION A general and recurring theme in this report concerns the potential value of making better use of all information that is available and relevant to the evaluation of a prospective military system. In current RAM-specific practices, the test and evaluation process does not fully exploit opportunities for using information

OCR for page 101
from disparate sources to obtain more reliable inferences (Bridgman and Glass, 1992). Cultural, technical, and organizational barriers may prevent operational testing analysts from using data from developmental testing or other sources in assisting in making inferences about system suitability. Such barriers may result in part from the congressional language and concerns reflected in Title 10, Sec. 2399, U.S. Code, which discusses the operational test and evaluation of defense acquisition programs. Raising this issue, Bridgman and Glass (1992:4) note: This language discourages the use of modeling in lieu of test and speaks of the suitability of the item tested, not the suitability of the complete system to be fielded. Its intent is to prevent suitability judgments from being made on promised rather than actual performance. However, it should not be extended, in our judgment, to preclude the use of all other sources of data or of modeling to augment test data and help evaluate the results. OT&E [operational testing and evaluation] is specifically authorized by law to have access to such information. We share the view that such uses of auxiliary information should be permissible (see Recommendation 4.3). Current practices, while understandably intended to maintain the independence of operational test and evaluation, have had the negative consequence of forcing RAM analysts either to live with higher than desirable variability or to incur substantial additional expense in executing the high numbers of replications needed to obtain the desired precision to support production decisions. Technical barriers exist because operational test personnel may not have been exposed to available methods as part of their statistical training. A number of established statistical approaches allow combining information from different sources (e.g., meta-analysis, hierarchical Bayes, and empirical Bayes methods), but their successful application typically requires the involvement of personnel with advanced training in statistics (see Chapter 10). Such personnel may not be routinely available to participate as team members in suitability assessments. We present here an illustration of the potential value of using statistical approaches to integrate information from multiple sources in assessing RAM performance. Table 7-1 shows failure data for air conditioning equipment on 13 Boeing 720 aircraft (Proschan, 1963; Gaver and O'Muircheartaigh, 1987). The observed hazard rates vary considerably—from about 3 to almost 17 failures per 1,000 hours. The number of observed failures in the test varies from aircraft to aircraft for two reasons: intrinsic reliability differences between the aircraft and random error due to the fact that the number of failures would differ from test to test even if the intrinsic reliability remained constant. Assuming that the 13 aircraft were homogeneous, with a common hazard rate of 10 failures per 1,000 hours, ordinary random variability would result in a range of observed hazard rates from 5 to 14. Since the observed range is only slightly greater than this, it makes sense to "shrink" these hazard rates in towards a common mean rate.

OCR for page 101
TABLE 7-1 Failures of Air Conditioning Equipment on 13 Boeing 720 Aircraft Aircraft ID Failures Hours (in thousands) Raw Failure Rate (failures/l.000 hr) 11 2 0.623 3.21 9 9 1.800 5.00 5 14 1.832 7.64 4 15 1.819 8.25 12 12 1.297 9.25 10 6 0.639 9.39 2 23 2.201 10.45 3 29 2.422 11.97 1 6 0.493 12.17 13 16 1.312 12.20 7 27 2.074 13.02 8 24 1.539 15.59 6 30 1.788 16.78 Incorporating the information available from all the aircraft and the methods discussed in Gaver and O'Muircheartaigh (1987), the estimated hazard rates for Aircraft #11 and #6—those with minimum and maximum observed hazard rates, respectively—can be reestimated to be approximately 8.5 and 13.5 failures per 1,000 hours, respectively. The crucial assumption used in shrinking the observed hazard rates toward the mean hazard rate is that these 13 aircraft are essentially indistinguishable from one another with regard to the process that produced their air conditioning systems. If individual aircraft had air conditioning systems of slightly different designs or if data had been gathered from aircraft under different operating conditions, then alternative assumptions and resulting estimators might be appropriate. It has been demonstrated both theoretically and empirically that using a shrinkage estimator in situations like this can yield marked improvements in predictive accuracy. Such established statistical methods for combining RAM data from multiple sources, when correctly applied, lead to more effective policies and decision making. For example, Hoadley (1981) showed that Bell Telephone quality assurance decisions can be made more accurately if empirical Bayes procedures are used. Morris (1983) discusses several other successful applications to important problems: The most persuasive arguments for the effectiveness of the procedures actually were made via cross-validatory methods. The shrinkage techniques predicted new data in each application more accurately than did the traditional methods, and the investigators showed that better decisions would have been made had empirical Bayes procedures been used in the past.

OCR for page 101
In the later stages of developmental testing, when the process of developing a reliable system prototype has become relatively stable, there will be experimental data that can be both relevant and quite useful to the operational tester. The use of that data in combination with the data collected in designed operational tests may offer some important benefits, including the possibility of an early resolution regarding a system's suitability. There should be greater openness to the selective (and supported) use of statistical methods for combining data from developmental and operational testing, as well as other relevant information, including subjective inputs from scientists with appropriate expertise and commercial or industrial data on related components or systems. Recommendation 7.7: Methods of combining reliability, availability, and maintainability data from disparate sources should be carefully studied and selectively adopted in the testing processes associated with the Department of Defense acquisition programs. In particular, authorization should be given to operational testers to combine reliability, availability, and maintainability data from developmental and operational testing as appropriate, with the proviso that analyses in which this is done be carefully justified and defended in detail. Organizational barriers exist to effective sharing of information about RAM performance. Two institutional problems contributing to poor data utilization within DoD have been identified (Jim Hodges, in Rolph and Steffey, 1994:44): "First, no one has responsibility for accumulating test data in one place and ensuring proper utilization. Second, there are competing objectives among the chief players—defense contractors, the services, and the Office of the Secretary of Defense—that complicates the collection and interchange of data." A recent RAND study of data quality problems in Army logistics (Galway and Hanks, 1996) pointed out that data quality problems are particularly acute when data are collected in one organization for use by another: The costs to collect data and to ensure quality (e.g., detailed edit checks at the point of entry) are often very visible to the collecting entity in terms of time and energy expended. The benefits may be very diffuse, however, particularly in a large organization . . . where data collected in one place may be analyzed and used in very distant parts of the organization with different responsibilities and perspectives. In these cases, one part of an organization may be asked or required to collect data that have little immediate effect on its own operations but that can be used by other parts of the organization to make decisions with long-term impacts. Intraorganizational incentives and feedback to insure data quality in these cases have been difficult to devise. These authors propose a three-level framework for understanding and classifying the nature of data problems:

OCR for page 101
Operational data problems are present when data values are missing, invalid, or inaccurate. Conceptual data problems are present when the data, because of imprecision or ambiguities in their definition, are not suitable for an intended use or, because of definitional problems, have been subjected to varying collection practices, again resulting in missing, invalid, inaccurate, or unreliable values. Organizational data problems occur when there are disconnects between the various organizations that generate and use data, resulting in a lack of agreement on how to define and maintain data quality. One symptom of organizational problems is the persistence of operational and conceptual problems over time, even after repeated attempts at solution. Changes during the process of RAM data collection (in such conditions as system configuration or test environment) are not always easily retrievable from existing databases. Such a conceptual data problem would critically interfere with the development of defensible methods for combining information from multiple test events. A related complication concerns the utility of test documentation. As discussed in Chapter 6, operational test reports often put forth raw estimates of parameters of interest, with no indication of the amount of uncertainty involved. When such reports are combined with other evidence of a system's performance (regardless of the quality of the additional information), the decision maker cannot describe in a definitive or statistically supportable way the risks associated with the acquisition of the system. In the military services, according to Galway and Hanks (1996): "Data seem insignificant compared to the physical assets of equipment, personnel, and materiel. However, data are also assets: they have real value when they are used to support critical decisions, and they cost real money to collect, store, and transmit. Improving the capabilities for archiving and retrieval of data and documentation (see Chapter 3) has potential benefits that are particularly great for RAM evaluation. RAM data on a prospective system taken from earlier stages of development, and from similar systems, can significantly improve the accuracy of conclusions drawn from operational testing and can reduce the amount of resources required for such testing. Therefore, efforts should be made to archive early RAM performance data for use in assessing operational suitability. Furthermore, the maintenance of a database that permits continuing RAM performance evaluation after the system is fielded is consistent with best industrial practices for quality management and product improvement (see below). Recommendation 7.8: All service-approved reliability, availability, and maintainability data, including vendor-generated data, from technical, developmental, and operational tests, should be properly archived and used in the final preproduction assessment of a prospective system. Af

OCR for page 101
ter procurement, field performance data and associated records should be retained for the system's life, and used to provide continuing assessment of its reliability, availability, and maintainability characteristics. Reliability, availability, and maintainability "certification" can be better accomplished through combined use of data collected during training exercises, developmental testing, component testing, bench testing, and operational testing, along with historical data on systems with similar suitability characteristics. Achieving this goal, however, will require commitment at the policy-making level, technical skill in the application of advanced statistical methods, and accessible data and documentation of acceptable quality. SPECIAL STATISTICAL METHODS FOR RELIABILITY TEST AND EVALUATION Two statistical procedures that are used in conjunction with reliability estimation of defense systems that merit individual discussion are reliability growth modeling and accelerated life testing. Reliability Growth Modeling In many defense acquisition programs, as in private industry, the reliability of system hardware grows during design, development, testing, and field use. Reliability growth results from continued engineering efforts to improve the design, manufacture, and operation of repairable system hardware. Formal statistical models and associated methods of analysis have been developed to represent such growth and to estimate future reliability and related quantities of interest. A reliability growth analysis typically involves fitting, to observed failure data, an equation expressing the underlying hazard rate as a (usually decreasing) function of time. For example, one popular model, developed at the Army Materiel Systems Analysis Activity (AMSAA) in the 1970s, assumes that failures occur according to a non-homogeneous Poisson process (U.S. Department of Defense, 1981). Model-based predictions of future reliability are potentially useful in at least three stages of the acquisition process: in early development, to estimate the level of engineering effort (i.e., hours of testing and corrective action) that will be required to achieve prespecified standards of reliability; in operational test planning, to estimate reliability at the time of a future operational test event, for purposes of determining the system's readiness for testing or the duration and conditions of testing; and

OCR for page 101
in the assessment of operational suitability, to incorporate projected post-test engineering efforts in estimating the field reliability of a prospective system. Some organizations involved in test and evaluation, particularly AMSAA, are regularly engaged in reliability growth modeling for purposes of planning and analysis. One recent example involved the family of medium tactical vehicles, which underwent an initial phase of operational testing from September through December 1993. This phase of testing was terminated because the system failed to meet reliability requirements. During the subsequent engineering effort, reliability growth models were fit to observed failure data. These models were used for each variant (e.g., cargo truck, van) in the vehicle family to predict reliability at the time of the third operational test phase and to estimate the probability of success in the test. The family of medium tactical vehicles was ultimately successful in the third phase, conducted from April to July 1995 (U.S. Department of Defense, 1995a). At this time the observed reliability of each variant during testing significantly exceeded its model-based prediction. Results like this raise concerns about the validity of reliability growth modeling. Reliability growth modeling has also been used to assess suitability in support of production decisions. Such cases typically involve complex, single-shot equipment with high unit costs (e.g., missiles) and highly reliable electronic subsystems that would require operational tests with a fixed configuration to be carried out over a period of several thousand hours (Rolph and Steffey, 1994). Some amount of reliability testing is done to identify deficiencies and verify corrective actions. Programmatic considerations, particularly unit cost and schedule, can then lead to circumstances in which the decision to commit production funds involves a criterion such as achieving an instantaneous reliability value at a prescribed point on the system's projected growth curve—a decision point that may occur before the completion of operational testing. Statistical modeling of reliability growth can be a valuable tool in the system development and testing process. However, the risk of significant discrepancies between predicted and observed reliability, as in the example of the family of medium tactical vehicles, underscores the need for thorough validation of reliability growth models. Issues related to the scoring of RAM data, discussed in a previous section, take on added importance in this context. Recommendation 7.9: Any use of model-based reliability predictions in the assessment of operational suitability should be validated a posteriori with test and field experience. Persistent failure to achieve validation should contraindicate the use of reliability growth models for such purposes.

OCR for page 101
Accelerated Life Testing A potentially serious deficiency in operational testing is the frequent inability to identify suitability problems relating to cumulative effects (e.g., aging and corrosion). One recent study cited several examples of such problems in new military systems, noting that in each case (Bridgman and Glass, 1992:6): "The OT [operational testing] did not last long enough for these problems to show up during the test, and no attempt was made in the evaluation of test results to use DT [developmental testing] results or other information to predict what suitability problems might appear after extended use in the field." Assessing the distribution of time to failure of systems (or components) with very high reliability or with susceptibility to cumulative effects is a difficult statistical problem because of the implied need for very large exposure times. For this reason, a variety of accelerated testing methods have been developed that allow data collected over relatively short experimentation times to be used to make inferences about system reliability over time ranges considerably longer than the experimentation times. The term "acceleration" has many different interpretations in the testing community, but it usually refers to one of three ways of making "time" (or any other scale used to measure system life) go ''faster" in a test by increasing certain factors that directly affect reliability: the use rate of the system; the rate at which age affects the system; and the degree of "stress" to which the system is subjected. In addition, there are generally two different types of information a tester can gather: system (or component) failure times; and  amount of degradation (physical, structural, etc.) in a system (or component). Accelerated testing methods are potentially applicable in developmental, as well as operational, testing of prospective military systems. Lall, Pecht, and Cushing (1994) discuss the need for accelerated testing in the design and manufacture of electronic products, and they identify three difficult issues that have limited the application and acceptance of such methods: determination of the dominant failure mechanisms, sites, and modes under intended life-cycle loads; determination of appropriate stress levels for accelerated tests; and assessment of product reliability under intended life-cycle loads from data obtained under accelerated tests. A critical underlying requirement for the acceptability of using any acceler

OCR for page 101
ated testing method is the need for a model that relates the acceleration factors (usage rate, temperature, physical stresses, vibration, voltage, pressure, etc.) to failure events. Such models tend to fall into two general types: physical, (based upon "first principles" of chemistry, physics, etc., and the various causative theories pertinent to the system; empirical, those based upon "fits" to experimental data obtained independent of the particular tests being performed. Given a model, statistical approaches to extrapolating reliability statements from accelerated tests are straightforward (if somewhat complicated). However, the influence the model has on the final reliability statements is generally more important than the test data themselves. The wide variety of acceleration models available (see, e.g., Meeker and Escobar, 1993; Nelson, 1990), as well as the need for understanding their acceptability in any particular situation, creates an environment that is subject to misuse and exploitation. Recommendation 7.10: Given the potential benefits of accelerated reliability testing methods, we support their further examination and use. To avoid misapplication, any model that is used as an explicit or implicit component of an accelerated reliability test must be subject to the same standards of validation, verification, and certification as models that are used for evaluation of system effectiveness. BEST RAM PRACTICES, TRAINING, AND HANDBOOKS An accepted set of "best RAM practices" is a goal much closer to being realized in industrial than in military settings. Models for components of these developments include the Organization for International Standardization (ISO) 9000 series and existing documents on practices in the automobile and telephone industries. The RAM-related documents in the ISO 9000 series address many subjects, ranging from the use of statistical principles and methods to data archiving and information feedback. ISO 9000-4 concerns dependability8 program management and provides, for example, the following guidance for product suppliers (numbers in parentheses refer to document sections): prepare specifications which contain qualitative and quantitative requirements for RAM performance and clearly state maintenance support assumptions (§ 6.3); establish and maintain procedures for effective and adequate verification and validation of dependability requirements (§ 6.7); 8   Dependability is a collective term used to describe availability performance and the performance of its influencing factors-reliability. maintainability, and maintenance support.

OCR for page 101
establish and maintain access to effective statistical and other relevant qualitative and quantitative methods and models appropriate for prediction, analysis, and estimation of dependability characteristics (§ 5.2); establish and maintain procedures for assessing life-cycle cost elements (§ 6.8); establish and maintain data banks to provide feedback on the dependability of products from testing and field operation in order to assist in product design, current product improvement, and maintenance support planning (§ 5.3); establish and maintain procedures for handling, storage, and analysis of failure and fault data from testing, manufacturing, and field operation (§ 6.11); and retain for an appropriate period, defined in relation to the expected product lifetime, all documents containing dependability requirements, analyses and predictions, test instructions and results, and field data analysis records (§ 5.4). Efforts to achieve more efficient (i.e., less expensive) decision making by pooling suitability data from various sources require documentation of the data sources and of the conditions under which the data were collected, as well as clear and consistent definitions of all terms used. Such efforts underscore the potential value of standardizing RAM testing and evaluation across the services and encouraging the use of best current practices. DoD might draw constructively on industrial practices, particularly in such areas as documentation, uniform standards, and the pooling of information on operational suitability. As evidenced in the ISO 9000 series, documentation of processes and retention of RAM-related records (for important decisions and valuable data) are practices now greatly emphasized in industry. The same should be true for DoD, especially for the purposes of assessing operational suitability in support of major production decisions. As we stress throughout this report, effective retention of information allows one to learn from historical data and past practices in a more systematic manner than is currently the case in operational testing. Recommendation 7.11: The Department of Defense should move aggressively to adapt for all test agencies the Organization for International Standardization (ISO) standards relating to reliability, availability, and maintainability. Great attention should be given to having all test agencies ISO-certified in their respective areas of responsibility for assuring the suitability of prospective military systems. Considerable differences with respect to RAM policy, practice, organization and methodology exist among the services, as well as within the testing community in each service. These differences may be partly attributable to variability in the training and expertise of developmental and operational testing personnel,

OCR for page 101
which in turn contributes to an uncritical reliance on certain modeling assumptions (e.g., exponentiality) in circumstances in which they may not be tenable. The manuals, handbooks and reference materials presently serving as the basis for military life-testing applications should be upgraded, the statistical level of the personnel who carry out the military's operational RAM testing analysis should be comparably enhanced, and the consideration of alternative models and methods for RAM testing should become routine in operational testing across the services. Recommendation 7.12: Military reliability, availability, and maintainability testing should be informed and guided by a new battery of military handbooks containing a modern treatment of all pertinent topics in the fields of reliability and life testing, including, but not limited to, the design and analysis of standard and accelerated tests, the handling of censored data, stress testing, and the modeling of and testing for reliability growth. The modeling perspective of these handbooks should be broad and include practical advice on model selection and model validation. The treatment should include discussion of a broad array of parametric models and should also describe non-parametric approaches.