Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 15
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report 2 Use of Experimental Design in Operational Testing The goal of an operational test is to measure the performance, under various conditions, of diverse aspects of a newly developed defense system to determine whether the system satisfies certain criteria or specifications identified in the Test and Evaluation Master Plan, the Cost and Operational Effectiveness Analysis, the Operational Requirements Document, and related documents. There are typically a large number of relevant conditions of interest for every system, including terrain, weather conditions, types of attack, and use of obscurants and other countermeasures or evasionary tactics. Some of these conditions can be completely controlled; some can be controlled at times, but are often constrained by factors such as budget and schedule; and some are not subject to control. DoD assigns weights to combinations of these conditions in the Operational Mode Summary and Mission Profiles, with greater weight given to those combinations perceived as more likely to occur (some combinations might even be impossible) and those of greater military importance. As noted in Chapter 1, the typical sample size available for test is small, because the size, scope, and duration of operational testing are dominated by budgetary and scheduling considerations. Therefore, it is beneficial to design the test as efficiently as possible so it can produce useful information about the performance of the system—both in the test scenarios themselves and through some modeling—in untested combinations of the types of conditions cited above. Experimental design provides the theory and methods to be applied in selecting the test design, which includes the choice of test scenarios and their time sequence; determination of the sample size; and the use of techniques such as randomization, controls, matching or blocking, and sequential testing. (For an historical discussion of these techniques, see Appendix B.) Other decisions—such as whether force-on-force testing is needed, whether the test is one-on-one or many-on-many or part of a still larger engagement, and whether to test at the system level or test individual components separately—are integral to the test design as well, and have important statistical consequences. A special aspect of DoD operational testing is the expense of the systems and the individual test articles involved. Even modest gains in operational testing efficiency can result in enormous savings either by correctly identifying when there is a need for cancellation, passing, or redesign of a system, or by reducing the need for test articles that can cost millions of dollars each.
OCR for page 16
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report As in industrial applications of experimental design for product testing, various circumstances can conspire to cause some runs to abort or cause the operational test designer to choose factors that were not planned in advance, thus compromising features such as the orthogonality of a design, causing confounding, and reducing the efficiency of the test. General discussion of these issues may be found in Hahn (1984). The panel is interested in understanding these real-life complications in DoD operational testing. As an example, the expense involved prohibits the use of more than one battleship in operational testing of naval systems, so it is impossible to measure the effects of differences in crew training on ship-to-ship variation. A further complication is that there are typically a large number of system requirements or specifications, referred to as measures of performance or effectiveness, for DoD systems. It is not uncommon to have as many as several dozen important specifications, and thus it is unclear which single specification should be used in optimizing the design. One could choose a “most important” specification, such as the hit rate, and select a design that would be efficient in measuring that output. However, the result might be inefficiency in measuring system reliability, since reliability measurement typically involves a broader distribution of system ages and a larger test sample size than a test of hit rate. Various statistical methods can be applied in an attempt to compromise across outputs. The panel does not address this problem in this interim report (though it is discussed briefly in Appendix B), but we intend to examine it for our final report. This chapter first presents the progress made by the panel in understanding how experimental design is currently used in DoD operational testing. This is followed by discussion of some concerns about certain aspects of the use of experimental design that we have investigated, although only cursorily. We next describe a novel approach proposed in an operational test conducted by the Army Operational Test and Evaluation Command (OPTEC) for the Army Tactical Missile System/Brilliant Anti-Tank system (a missile system that targets moving vehicles, especially tanks), which involves the use of experimental design in designing a small number of field tests to be used in calibrating a simulation model for the system. The panel is interested in understanding this approach, which may become much more common in the future as a method for combining simulations and field tests. The final section of the chapter describes the future work we plan to undertake before issuing our final report. As background material for the discussion in this chapter, a short history of experimental design is provided in Appendix B. Some elements of our discussion—for example, testing at the center versus the edge of the operating envelope—focus on the question of how to define the design space (and associated factor levels). Other elements—for example, possible uses of fractional factorial designs—are concerned with approaches for efficiently testing a well-defined design space. Such considerations are thus complementary and are among the diverse issues that must be addressed in designing a sound operational test that yields accurate and valuable information about system performance. CASE STUDY #1: APACHE LONGBOW HELICOPTER To become better acquainted with the environment, characteristics, and constraints of DoD operational testing, the panel began by investigating a single system and following it from the test design, through the testing process, and finally to evaluation of the test results. After considering the current systems under test by OPTEC, the panel decided to examine the operational testing of the Apache Longbow helicopter. 1 1 The panel is extremely grateful for the cooperation of OPTEC in this activity, especially Henry Dubin, Harold Pasini, Carl Russell, and Michael Hall.
OCR for page 17
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report An initial meeting at OPTEC acquainted the panel with the overall test strategy: an operational test gunnery phase at China Lake, California, with eight gunnery events, followed by a force-on-force phase at Fort Hunter Liggett, composed of 30 trials (15 involving a previous version of the helicopter for purposes of control and 15 using the newer version). The gunnery phase was intended to measure the lethality of the Longbow, while the force-on-force test would measure survivability and suitability. There are 46 measures of performance associated with this system. The panel focused its attention on the force-on-force trial, visiting the test site shortly before the operational test occurred to examine the test facility; understand the methods for registering the simulated effects of live fire; and gain a better understanding of constraints involving such factors as the facility, the soldiers, and the maintenance of the test units. Demonstrations provided at this time included the simulation of various enemy systems; scoring methods; and data collection, interpretation, and analysis that would be used to help understand why targets were missed during the test by studying whether there might have been some anomaly in the test conduct. The panel was also interested in assessment of reliability for the system. The force-on-force operational test involved two major factors under the control of the experimenter: (1) mission type—movement to contact (deep), deliberate attack (deep), deliberate attack (close), or hasty attack (close); and (2) day or night. Either zero, two, or three replications were selected for combinations of mission type with day/night; zero replications was chosen for impossible combinations (for example, deep attacks are not carried out during daylight because of the formidable challenges to achieving significant penetration of enemy lines during periods of high visibility.) The panel continued its examination of the Apache Longbow's operational testing through a presentation by OPTEC staff outlining the evaluation of the test results after the testing had been completed. This presentation emphasized the collection, management, and analysis of data resulting from the operational test, including how these activities were related to various measures of performance, the treatment of outliers, and an outline of the decision process with respect to the final hurdle of milestone III. However, this briefing did not address some details in which we were interested concerning the summarization of information related to the 46 measures of performance and how the Defense Acquisition Board will use this information in deciding whether to proceed to full production for this system. Therefore, we intend to examine the evaluation process in more depth in completing our study of this operational test. In examining the Apache Longbow operational test, the panel inspected various documents related to the test, including the Test and Evaluation Master Plan and the Operational Mode Summary and Mission Profiles. In addition, we examined some more general literature on the use of experimental design in operational testing, especially Fries (1994). Completion of this case study will only modestly acquaint the panel with DoD operational testing practice, since it will represent experience with only one system for one service. Before issuing our final report, we will determine the extent to which the Apache Longbow experience is typical of Army operational testing practice. Some Issues and Concerns The panel is impressed to see that basic principles are often used in operational tests. There seems to be much to commend about current practice. We believe the use of a control in the Apache Longbow operational test, which we consider important, is typical of Army tests. The quasi-factorial design of the Apache Longbow test also appears to be typical and indicates that the Army understands the efficiency gained from simultaneous change of test factors. At the same time, our visit to Fort Hunter Liggett made
OCR for page 18
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report it clear that routine application of standard experimental design practice will not always be feasible given the constraints of available troops, scheduling of tests, small sample sizes, and the weighting of test environments. Choice of Test Scenarios Various considerations combine in determining the test scenarios used in operational testing. Certainly the perceived likelihood or importance of an application of the new system should be considered when selecting scenarios for test. Also, if the system is to be used in a variety of situations, it is important to measure its performance in each situation relative to that of a control, subject, of course, to limitations of the available test resources. This was done in the Apache Longbow testing we observed. However, the panel did not see enough evidence that another important factor was considered in selecting the test scenarios: the a priori assessment of which scenarios would be discriminating in identifying situations where the new system might dominate the control, or vice versa. For instance, if 80 percent of the applications of a new system are likely to be at night, but the new system has no perceived advantage over the control at night, there is less reason to devote a large fraction of the test situations to night testing. Of course, the a priori wisdom may be wrong, but even if it is only approximately correct, substantial inefficiencies can result from allocating too much of the test to situations where both systems are likely to perform similarly. Uncertain Scoring Rules, Measurement Inefficiencies The panel has noted two concerns with respect to scoring and measurement. There is some evidence that the scoring roles for such factors as hits on a target of a missile are vaguely defined, specifically with respect to which events are considered unusable or precisely how a trial is defined. We present three examples of this. First, although reliability, availability, and maintainability failures are usually defined for a given system, there is often much disagreement within the scoring conference on whether a given system failure should be counted.2 The question of what constitutes an outlier in an operational test, while difficult to answer in advance, should be addressed as objectively as possible. Second, when a force-on-force trial is aborted while in progress for a reason such as accident, weather, or instrumentation failure, the issue arises of how the data generated before the failure should be used. Third, the problem persists of how to handle data contaminated by failure of instrumentation or test personnel. It is also important to be more precise about the objective of each operational test. Sometimes what is desired is to understand the performance of the system in the most demanding of several environments, so the objective is to estimate a lower bound on system performance; at other times what is needed is a measurement of the average performance of the system across environments. Sometimes what is desired is to find system defects; at other times the objective is to characterize the operational profile. Ultimately, optimal test design—maximizing information for fixed test cost—depends on what the goal of testing is. For example, if one is testing to find faults, it is better to use experimental designs that “rig” the test in informative ways. As a further complication, operational tests can be used to calibrate and evaluate simulation models, and the design implications may be different for those objec- 2 A scoring conference is a group of about six to eight people—including testers, evaluators, and representatives of the program manager—who review information about possible test failures and determine whether to count such events as failures or exclude them from the regular part of the test evaluation.
OCR for page 19
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report tives. A related point is that averaging across different environments could mask important evidence of differentials in system effectiveness. For example, an average 60 percent hit rate could result either from a 60 percent hit rate in all environments or from a 100 percent hit rate in 60 percent of the environments and a 0 percent hit rate in 40 percent of the environments; clearly these are two very different types of performance. Thus, expressing results in terms of an average success rate is not wholly satisfactory. A final point with respect to measurement is that much of the data that measure system effectiveness, especially with respect to hit rates, are zero-one data. It is well known that zero-one data are much less informative than data that indicate by how much a target was missed. This information can be used for better modeling the variability of a shot about its mean, which in turn can be used for better estimating the hit rate. The panel has been informed that this issue is being examined by testers at Fort Hunter Liggett. Testing Only “Inside the Envelope” Experimental design theory and practice suggest that in testing a system to determine its performance for many combinations of environmental factors, a substantial number of tests should be conducted for fairly extreme inputs, as well as those occurring more commonly. Modeling can then be used to estimate the performance for various intermediate environments. Operational testing, however, tends to focus on the environments most likely to occur in the field. While this approach has the advantage that one need not use a model to estimate the system performance for the more common environments, the disadvantage is that little is known about the performance when the system is strongly stressed. Certainly, this issue is explored in developmental testing. However, the failure modes and failure rates of operational testing tend to be different than those of developmental testing.3 CASE STUDY #2: THE ATACMS/BAT SYSTEM In the Army Tactical Missile System/Brilliant Anti-Tank (ATACMS/BAT) system, OPTEC proposes to use a relatively novel test design in which a simulation model, when calibrated by a small number of operational field tests, will provide an overall assessment of the effectiveness of the system under test. BAT submunitions use acoustic sensors to guide themselves toward moving vehicles and an infrared seeker to home terminally on targets. The submunitions are delivered to the approximate target area by the ATACMS, which releases the submunitions from its main missile. The ATACMS/BAT system is very expensive, costing several million dollars per missile. The experimental design issue is how to choose the small number of operational field tests such that the calibrated simulation will be as informative as possible. The panel is examining this problem. Here we provide a description of the ATACMS/BAT operational testing program to illustrate one context in which to think about alternative approaches to operational test design. 3 One possible reason that operational testing usually does not involve more extreme scenarios is the risk that performance results in these scenarios will be misinterpreted as indicative of the system's typical performance, without proper consideration of the relative likelihood of the scenarios.
OCR for page 20
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report Plans for Operational Testing The ATACMS/BAT operational test will not be a traditional Army operational test involving force-on-force trials, but rather will be similar to a demonstration test. The Army would use a model to simulate what might happen in a real engagement. To calibrate the simulation, various kinds of data will be collected, for example, from individual submunition flights and other types of trials, including the operational test trials. A relatively novel aspect of this operational test is the use of a limited number of operational test events along with a simulation to evaluate a new weapons system. This approach has been taken because budgetary limitations on the sample size and the limited availability of equipment such as radio-controlled tanks for testing make it infeasible to develop a program of field testing that could answer the key questions about the performance of this system in a real operational testing environment. Operational testing of the ATACMS/BAT system is scheduled to take place in 1998. According to the Test and Evaluation Master Plan: This portion of the system evaluation includes the Army ATACMS/BAT launch, missile flight, dispense of the BAT submunitions, the transition to independent flight, acoustic and infrared homing, and final impact on targets. Evaluation of this discrete event also includes assessment of support system/subsystem RAM [reliability, availability, and maintainability] requirements, software, terminal accuracy, dispense effectiveness, kills per launcher load, and BAT effectiveness in the presence of countermeasures. Initial operational test and evaluation is the primary source of data for assessing these system capabilities. There is no baseline system for comparison. The number of armored vehicle kills (against a battalion of tanks) is the bottom-line measure of the system's success. Tank battalions vary in size, but typically involve about 150 vehicles moving in formation. (Unfortunately, every country moves its tanks somewhat differently.) Under the test scoring rules, no credit is given if the submunition hits the tank treads or a truck or if two submunitions hit the same tank. There is one operational test site, and the Army has spent several million dollars developing it. There will be materiel constraints on the operational test. Only eight missiles, each of which has a full set of 13 BAT submunitions, are available for testing. Also, the test battalion will involve only 21 remotely controlled vehicles. Thus, the Army plans to use simulation as an extrapolation device, particularly in generalizing from 21 tanks to a full-size battalion (approximately 150 tanks). Important Test Factors and Conditions As stated above, all stages of missile operation must be considered in the operational test, particularly acoustic detection, infrared detection, and target impact. Factors that may affect acoustic detection of vehicles include distance from target (location, delivery error), weather (wind, air density, rain), vehicle signature (type, speed, formation), and terrain. For example, the submunitions are not independently targeted; they are programmed with logic to go to different targets. Their success at picking different targets can be affected by such factors as wind, rain, temperature, and cloud layers. Obviously, one cannot learn about system performance during bad weather if testing is conducted only on dry days. However, it is difficult to conduct operational tests in rain because the test instrumentation does not function well, and much data can be lost. Such factors as weather (rain, snow) and environment (dust, smoke) can also affect infrared detection of vehicles. Factors affecting the conditional probability of a vehicle kill given a hit include the hardness of the vehicle and the location of the hit. Possible countermeasures must also be considered. For example, the tanks may disperse at some point, instead of advancing in a straight-line formation, or may try to employ decoys or smoke obscuration.
OCR for page 21
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report The operational test design, or shot matrix, in the Test and Evaluation Master Plan (see Table 2-1) lists eight test events that vary according to such factors as range of engagement, target location error, logic of targeting software, type of tank formation, aimpoint, time of day, tank speed and spacing, and threat environment. Three levels are specified for the range of engagement: near minimum, near maximum, and a medium range specified as either “2/3s” or “ROC,” the required operational capability. Target location error has two levels: median and one standard deviation (“1 sigma ”) above the median level. (The distinction between centralized and decentralized is unimportant in this context.) The logic of targeting software (primary vs. alternate) and type of tank formation (linear vs. dispersed) are combined into a single, two-level factor. Aimpoint distance is either short or long, and aimpoint direction is either left or right. The aimpoint factors are not expected to be important in affecting system performance. (The payload inventory is also unimportant in this context.) Tanks are expected to travel at lower speeds and in denser formations during night operations; therefore, tank speed and spacing are combined with time of day into a single two-level factor (day vs. night). Three different threat environments are possible: benign, Level 1, and Level 2 (most severe). Clearly, in view of the limited sample size, many potentially influential factors are not represented in the shot matrix. Possible Statistical Methods and Aspects for Further Consideration One approach to operational testing for the ATACMS/BAT system would be to design a large fractional factorial experiment for those factors thought to have the greatest influence on the system performance. The number of effective replications can be increased if the assumption that all of the included design factors are influential turns out to be incorrect. Assuming that the aimpoint factors are inactive, a complete factorial experiment for the ATACMS/BAT system would require 23 × 32 = 72 design points. However, fractional factorial designs with two- and three-level factors could provide much information while using substantially fewer replications than a complete factorial design. Of course, these designs are less useful when higher-order interactions among factors are significant. (For a further discussion of factorial designs, see Appendix B, as well as Box and Hunter, 1961.) Another complication is that environment (or scenario) is a factor with more than two settings (levels). In the extreme, the ATACMS/BAT operational test results might be regarded as samples from several different populations representing test results from each environment. Since it will not be possible to evaluate the test in several unrelated settings, some consolidation of scenarios is needed. It is necessary to understand how to consolidate scenarios by identifying the underlying physical characteristics that have an impact on the performance measures, and to relate the performance of the system, possibly through use of a parametric model, to the underlying characteristics of those environments. This is essentially the issue discussed in Appendix C. While the above fractional factorial approach has advantages with respect to understanding system performance equally in each scenario, we can see some benefits of the current OPTEC approach if we assume that the majority of interest is focused on the “central” scenario, or the scenario of most interest. In the current OPTEC approach, the largest number of test units are allocated to this scenario, while the others are used to study one-factor-at-a-time perturbations around this scenario, such as going from day to night or from linear to dispersed formation. This approach could be well suited to gathering information on such issues while not losing too much efficiency at the scenario of most interest. And if it turns out that changing one or more factors has no effect, the information from these settings can be pooled to gain further efficiency at the scenario of most interest.
OCR for page 22
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report TABLE 2-1 Army Tactical Missile System Block II/Brilliant Anti-Tank Operational Test Shot Matrix Testa Range TLE (Method of Control) Target/Logic Aimpoint Payload Environmentb DT/OT 1 ROC ROC Primary/Linear Long/Right 9 Tt, 4 Tsw Benign ROC Level 1 night DT/OT 2 2/3s Median centralized Primary/Linear Short/Right 13 Tt Level 1 night OT 1 Near max 1 sigma centralized Primary/Linear Short/Left 13 Tt Levels 1 & 2 Day OT 2 Near min Median decentralized Alternate/Dispersed Long/Left 7 Tt, 6 Tsw Level 1 day OT 3c 2/3s Median dentralized AMC Primary/Linear C3 system determined 10 Tt, 3 Tsw Level 1 night OT 4c 2/3s Median centralized AMC Primary/Linear C3 systems determined 10 Tt, 3 Tsw Level 1 night OT 5d 2/3s 1 sigma decentralized Primary/Linear Long/On line 13 Tt Levels 1 & 2 night OT 6d 2/3s Median centralized To be determined To be determined 13 Tt Level 1 day NOTE: AMC Army Materiel Command C3 Command, Control and Communications DT Developmental Testing OT Operational Testing ROC Required Operational Capability a This shot matrix reflects a two-flight developmental testing/operational testing program of engineering and manufacturing development assets conducted by the Test and Evaluation Command, and a six-flight operational testing program of Low-Rate Initial Production brilliant anti-tank assets conducted by the Test and Experimentation Command. This shot matrix is subject to threat accreditation. b Flights will be conducted during daylight hours for safety and data collection. Night/day environments pertain to vehicle speed and spacing. c OT 3 and OT 4 are ripple firings. d Flight objectives may be revised based on unforeseen data shortfalls.
OCR for page 23
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report FUTURE WORK The panel does not yet understand the extent to which experimental design techniques, both routine and more sophisticated, have become a part of operational test design. Therefore, this will be a major emphasis of our further work. Relatively sophisticated techniques might be needed to overcome complications posed by various constraints discussed above. Bayesian experimental design might provide methods for incorporating information from developmental testing in the designs of operational tests. The notion of pilot testing has been discussed and will be further examined. We will examine the use of experimental design in current operational test practice in the Air Force and the Navy. This effort will include studying relevant literature that provides examples of current practice and the directives presenting test policy in these services. It will include as well investigating the potential opportunity for and benefit from the use of more sophisticated experimental design techniques. Also, the panel will continue to follow progress in the design of the operational test for the ATACMS/BAT system. We will also devote some additional effort to understanding the use of experimental design in operational testing in industry and in nonmilitary federal agencies. The panel is also interested in a problem suggested by Henry Dubin, technical director of OPTEC: how to allocate a small sample of test objects optimally to several environments so that the test sample is maximally informative about the overall performance of the system. Appendix C presents the panel's thoughts on this topic to date, but we anticipate revisiting the problem and refining our thinking. The panel understands that both the degree of training received prior to an operational test and the learning curves of soldiers during the test are important confounding factors. While developmental testing generally makes use of subjects relatively well trained in the system under test, operational testing makes use of subjects whose training more closely approximates the training a user will receive in a real application. It is important to ensure that comparisons between a control and a new system are not confounded by, say, users' being more familiar with the control than the new system, or vice versa. The panel is examining the possibility of using trend-free and other experimental designs that address the possible confounding of learning effects (Daniel, 1976; Hill, 1960). For example, binary data on system kills are typically not binomial; instead they are dependent because of the learning effects during trials of operational tests. Player learning is generally not accounted for in current test practice. At best, there is side-by-side shooting in which, perhaps, learning occurs at the same rate during comparative testing of the baseline and prospective systems. With respect to evaluation, a real problem is how to decide what sample size is adequate for making decisions with a stated confidence. The panel will examine this question—sometimes referred to as “How much testing is enough?—for discussion in our final report. This effort may involve notions of Bayesian inference, decision theory, and graphical presentation of uncertainty. It is important to note in this context that experimental design principles can help make effective use of available test resources, but no design can provide definitive conclusions when insufficient data have been collected. The panel is interested in examining explicitly the tradeoff between cost and benefit of testing in our final report. Furthermore, the panel believes that the hypothesis-testing framework of operational testing is not sensible. The object of operational testing should be to provide to the decision maker the data most valuable for deciding on the next course of action. The next course of action belongs in a continuum ranging from complete acceptance to complete rejection. Therefore, in operational testing one should concentrate on estimation procedures with statements of attendant risks. We also plan to explore the utility of other methods for combining information for purposes of evaluation, including hierarchical modeling.
Representative terms from entire chapter: