Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 33
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report 5 Use of Modeling and Simulation in Operational Testing The apparent success of many military simulations for the purposes of training, doctrine development, investigation of advanced system concepts, mission rehearsal, and assessments of threats and countermeasures has resulted in their increased use for these purposes.1 At the same time, the DoD testing community, expecting substantial reductions in its budget in the near future, has expanded the use of simulations to assist in operational testing, since they offer the potential of safe and inexpensive “testing” of hardware. Our specific charge in this area is to address how statistical methods might be appropriately used to assess and validate this potential. It seems clear that few if any of the current collection of simulations were designed for use in developmental or operational testing. Constructive models, such as JANUS and CASTFOREM, have been used for comparative evaluations of specific capabilities among candidate systems or parameter values, but not to justify with any validity those systems or value comparisons within an actual combat setting.2 Therefore an important issue must be explored: the extent to which simulations, possibly with some adjustments and enhancements, could be used to assist in developmental or operational testing to save limited test funds, enhance safety, effectively increase the sample size of a test, or perhaps permit the extrapolation of test results to untested scenarios. The goals of building simulations for training and doctrine are not necessarily compatible with the goals of building simulation models for assessing the operational effectiveness of a system. Some specific concerns are as follows: Can simulations built for other purposes be of use in either developmental or operational testing in their present state? What modifications might generally improve their utility for this purpose? 1 We use the term “simulation” to mean both modeling and simulation. 2 JANUS and CASTFOREM are multipurpose interactive war-game models used to examine tactics.
OCR for page 34
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report How can their results, either in original form or suitably modified, be used to help plan developmental or operational tests? How can the results from simulations and either developmental or operational tests be combined to obtain a defensible assessment of effectiveness or suitability? For example, simulation might be used to identify weak spots in a system before an operational test is conducted so one can make fixes, alleviate problems in the testing environment (e.g., adjusting for units that were not killed because of instrumentation difficulties), or identify scenarios to test by selecting those that cover a spectrum of differences in performance between a new system and the system it is replacing. These questions are all related to the degree to which simulations can approximate laboratory or field tests, and their answers involve verification, validation, and accreditation. Thus we have also investigated the degree to which simulations have been (or are supposed to be) validated. Two issues relate specifically to the use of statistics in simulation. First is the treatment of stochastic elements in simulations, whether the simulations are used either for their original purposes or for operational testing, and associated parameter estimation and inference. Second is the use of simulations in the development of requirements documents, since this represents part of the acquisition process and might play a role in a more continuous evaluation of weapons systems. The next section describes the panel's scope, procedures, and progress to date in the modeling and simulation area. This is followed by discussion of several concerns raised by our work thus far. The chapter ends with a summary and a review of the panel's planned future work in this area. SCOPE, PROCEDURES, AND PROGRESS TO DATE To carry out our charge in this area, the panel has examined a number of documents that describe the current use of simulations both for their original purposes and for the purpose of operational testing. Some documents that have been particularly useful are Improving Test and Evaluation Effectiveness (Defense Science Board, 1989), a 1987 General Accounting Office report on DoD simulation (U.S. General Accounting Office, 1987), Systems Acquisition Manager's Guide for the Use of Models and Simulation (Defense Systems Management College, 1994), and A Framework for Using Advanced Distributed Simulation in Operational Test (Wiesenhahn and Dighton, 1993). The panel's working group on modeling and simulation participated in a one-day meeting at the Institute for Defense Analyses, where briefings were given on various simulations and their utility for operational testing. Our experiences to date have focused on Army simulations (though the one-day meeting involved simulations for both the Navy and the Air Force). Before the panel's final report is issued, we intend to (1) convene one or two more meetings and examine more of the relevant literature; (2) investigate more and different types of simulations; (3) familiarize ourselves with more of the standards and reference documents from the services other than the Army; and (4) examine more of the relevant practices from industry, from defense acquisition in other countries, and from the federal government in organizations such as NASA. The panel's investigation in this area is clearly related to topics being treated by some of our other working groups. Therefore, relevant discussion for some of the issues raised in this chapter may be found in the chapters on experimental design, software reliability, and reliability, availability, and maintainability. This is because (1) the planning of a simulation exercise involves considerations of experimental design; (2) simulations are usually software-intensive systems and therefore raise the issue
OCR for page 35
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report of software reliability; and (3) reliability, availability, and maintainability issues should necessarily be included in any system that attempts to measure suitability or operational readiness. Unfortunately, we have found that military system simulations often do not incorporate notions of reliability, availability, and maintainability and instead implicitly assume that the simulated system is fully operational throughout the exercise. Even though much of the discussion below focuses on problems, we applaud efforts made throughout DoD to make use of simulation technology to save money and make more effective use of operational testing. Much of the DoD work on simulation is exciting and impressive technologically. The panel is cognizant that budgetary constraints are forcing much of this interest in simulation. However, there is a great deal of skepticism in the DoD community about the benefits of using simulations designed for training and other purposes, for the new purpose of operational testing. Our goal is not to add to this pessimistic view, but to assist DoD in its attempt to use simulation intelligently and supportably for operational testing. CONCERNS Rigorous Validation of Simulations Is Infrequent Two main activities that make up simulation validation are broadly defined as follows: (1) external validation is the comparison of model output with “true” values, and (2) sensitivity analysis is the determination of how inputs affect various outputs (an “internal” assessment). External validation of simulations is difficult and expensive in most applications, and the defense testing application is no exception. There are rarely “true” values available since engagements are, fortunately, relatively rare; they occur in specific environments; and when they occur, data collection is quite difficult. Furthermore, operational tests do not really provide true values since they are also simulations to some extent; for example, no live ammunition is used, the effects of the use of weapons are simulated through the use of sensors, personnel, and equipment are not subject to the stresses of actual battle, and maneuver areas are constrained and to some extent familiar to some testing units. While these arguments present genuine challenges to the process of external validation, they should not be taken to imply that such external validation is impossible. The panel is aware of a few attempts to compare simulation results with the results of operational tests. We applaud these efforts. Any discrepancies between simulation outputs and operational test results (relative to estimates of the variance in simulation outputs) indicate cause for concern. Such discrepancies should raise questions about a simulation's utility for operational testing, and should trigger careful examination of both the model and the test scenario to identify the reason. It should be noted that a discrepancy may be due to infidelity of the test scenario to operational conditions, and not necessarily to any problem with the simulation. There is an important difference (one we suspect is not always well understood by the test community in general) between comparing simulation outputs with test results and using test results to “tune” a simulation. Many complex simulations involve a large number of “free” parameters—those that can be set at different values by the analyst running the simulation. Some of these parameters are set on the basis of prior field test data from the subsystems in question. Others, however, may be adjusted specifically to improve the correspondence of simulation outputs with particular operational testing results with which they are being compared. Particularly when the number of free parameters is large in relation to the amount of available operational test data, close correspondence between a “tuned” simulation and operational results does not necessarily imply that the simulation would be a good
OCR for page 36
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report predictor in any scenario differing from those used to tune it. A large literature is devoted to this phenomenon, known as “overfitting. ”3 On the other hand, in some cases a simulation may be incapable of external validation in the strict sense of correspondence of its outputs with the real world, but may still be useful in operational testing. For an example, suppose that it is generally agreed that a simulation is deficient in a certain respect, but the direction of the deficiency's impact on the outcome in question is known. This might occur, for example, under the following conditions: An otherwise accurate simulation fails to incorporate reliability, availability, and maintainability factors. System A is known to be more reliable than system B. It is agreed that increased reliability would improve the overall performance of system A relative to system B. Then if system A performs better in a simulation than system B, it can be argued that the result would have been even more strongly in favor of system A had reliability, availability, and maintainability factors been incorporated (for other examples, see Hodges and Dewar, 1992). The use of sensitivity analysis is more widespread and applied sensibly in the validation of DoD simulations. However, the high dimensionality of the input space for many simulations necessitates use of more efficient sampling and evaluation tools for learning about the features of these complex simulations. Therefore, to choose input samples more effectively, sampling techniques derived from fractional factorial designs (see Appendix B) or Latin Hypercube sampling (see McKay et al., 1979) should be used rather than one-at-a-time sensitivity analysis. To help analyze the resulting paired input-output data set, various methods for fitting surfaces to collections of test points, either parametrically with response surface modeling (see Appendix B) or nonparametrically (e.g., multivariate adaptive regression splines; see Friedman, 1991) should be used. At this point in the panel's investigation, we have yet to see a thorough validation of any simulation used for operational testing. Furthermore, we have yet to be provided evidence that modern statistical procedures are in regular use for simulation validation. (This is in contrast to the effort—and obvious successes—in the verification process.) On the other hand, there are well-validated simulations used for “providing lessons” on concepts. For example, EADSIM (a force-on-force air defense simulation) was apparently validated against field tests with favorable results. While the DoD directives we have examined are complete in their discussion of the need for validation and documentation, we have had difficulty in determining how well these directives have been followed in practice. Not much seems to have changed since the following was noted seven years ago (U.S. General Accounting Office, 1988:11, 46): In general, the efforts to validate simulation results by direct comparison to data on weapons effectiveness derived by other means are weak, and it would require substantial work to increase their credibility. Credibility would have been helped by . . . establishing that the simulation results were statistically representative. Perhaps the strongest 3 Overfitting is said to occur for a model-data set combination when a simple version of the model from a model hierarchy, that is formed by setting some parameters to fixed values (typically zero), is superior in predictive performance to a more complicated version of the model that is formed by estimating these parameters from the data set. To be less abstract, an objective definition of overfitting is possible through correspondence with a statistic that measures it, e.g., the Cp statistic for multiple regression models. Thus, multiple regression models with a high Cp statistic could be defined as being overfit.
OCR for page 37
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report contribution to credibility came from efforts to test the parameters of models and to run the models with alternative scenarios. Although many attempts have been made to develop procedures for assessing the credibility of a model/simulation, none have gained widespread acceptance. At the present time, there is no policy or process in place in DoD to assess the credibility of specific models and simulations to be used in the test and evaluation and the acquisition process. We have found little evidence to show that this situation has changed substantially. There is a large literature on the statistical validation of complicated computer models. While this is an active area of research, a consensus is developing on what to do. This literature needs to be more widely disseminated among the DoD modeling community: important references include McKay (1992), Hodges (1987), Hodges and Dewar (1992), Iman and Conover (1982), Mitchell and Wilson (1979), Citro and Hanushek (1991), Doctor (1989), and Sacks et al. (1989). Little Evidence Is Seen for Use of Statistical Methods in Simulations There are a number of ways statistical methods can be brought to bear in the design of simulation runs and the analysis of simulation outputs. The panel has seen little evidence of awareness of these approaches. Specific areas that might be incorporated routinely include the following: Understanding of the relationship between inputs and outputs. Methods from statistical design of experiments can be applied to plan a set of simulation runs to provide the best information about how outputs vary as inputs change. Response surface methods and more sophisticated smoothing algorithms can be applied to interpolate results to cases not run. Estimation of variances, and use of estimated variances in decision making. There are recently developed new and easily applied methods for estimating variances of simulation outputs. Analysis-of-variance methods can be used to draw inferences about whether observed differences are real or can be explained by natural variation. These methods can be applied in the comparison of results for different systems, or in the validation of simulations. For example, the Army Operational Test and Evaluation Command (OPTEC) performed a study comparing live, virtual (SIMNET), and constructive (CASTFOREM) simulations for the M1A2 (an advanced form of the tank used in Operation Desert Storm). The results demonstrated the limitations of simulation in approximating operational testing; this finding was supported by the difference between the results from the live and the virtual and constructive simulations. However, no confidence intervals were reported for the differences in output (which is not to say that the development of confidence intervals did not require sophisticated techniques). Therefore, there was no formal basis for inferring whether the differences found were real or simply due to natural variation. It is likely that analysis-of-variance and related techniques could have been used to examine whether the difference between these simulations was due to natural variation or systematic differences between simulations. Detection and analysis of outliers. Outliers should be examined separately to determine the reasons for the unusual values. Outliers in operational test data may be due to coding errors or to unusual circumstances not representative of combat. If this is determined to be the case, then the outlier in question should be handled separately from other data points. In simulation results, outliers are crucial for identifying regions of the input-output space in which the behavior of the simulation changes qualitatively.
OCR for page 38
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report DoD Literature on Use of Simulations Often Lacks Statistical Content As momentum builds for the use of simulation in operational testing, there is clearly concern in the DoD community that simulation be applied correctly and cost-effectively. Most of the literature and discussions on the use of simulations in operational testing encountered by the panel were thoughtful and constructive. (A particularly good discussion can be found in Wiesenhahn and Dighton, 1993.) However, the panel is concerned with the extremely limited discussion of statistics and the almost total lack of statistical perspective in the literature we encountered. For example, there is serious concern in the community that simulations be appropriately validated. The literature repeatedly stresses that validation must be related to the purpose at hand (see, e.g., Hodges and Dewar, 1992). The panel agrees with this point, applauds the concern with proper validation, and found the discussions to be thorough and correct. However, there is little discussion of what it means to demonstrate that a simulation corresponds to the real world. Given the level of uncertainty in the results of operational testing scenarios and the stochastic nature of many simulations, it is clear that correspondence with the real world can be established only within bounds of natural variability. Thus, validation of simulation is largely a statistical exercise. Yet there is almost no discussion of statistical issues in the various directives and recommendations we encountered on the use of simulations in operational testing. Where there is discussion of statistical issues, the treatment is generally not adequate. As noted above, any discussion of statistics is usually concerned with the precision with which expectations can be estimated. We found little evidence of concern for estimating variances or for proper consideration of variability in the use of results for decision making. Moreover, statistical procedures designed for fixed sample sizes have been applied inappropriately in sequential processes. (See, e.g., U.S. General Accounting Office, 1987, which discusses uncritically a procedure in which confidence intervals were constructed repeatedly using additional runs of a simulation, until the interval was sufficiently narrow that it was determined no further runs were needed.) Distributed Interactive Simulation Raises Additional Concerns Distributed interactive simulation is a relatively new technology. For systems in which command and control is an important determinant of operational effectiveness, traditional constructive simulations not incorporating command and control are of limited use in evaluating operational effectiveness. Distributed interactive simulation with man-in-the-loop may have the potential to incorporate command and control at less expense than a field test. Although actual military applications to date have been limited, it is widely believed that distributed interactive simulation can contribute to effective use of simulation in the operational testing process, and widely claimed that it can improve some aspects of realism in the operational test environment. For example, semiautomated forces can be used to simulate threat densities not possible in field tests. Our concerns regarding statistical issues apply with even more gravity to distributed interactive simulation. Running a distributed interactive simulation is likely to be more time-consuming and expensive than running most conventional constructive simulations. This raises questions about the ability to obtain sufficient sample sizes to estimate results with reasonable precision. Moreover, there is a temptation to presume that the conditions built into a distributed interactive simulation are in fact directly analogous to controllable or independent variables, and are thus subject to the same kinds of statistical treatments. In fact, the elements of most human-introduced aspects of a distributed interactive simulation are peculiar to the setting in which the simulation is run (reflecting such factors as fatigue,
OCR for page 39
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report TABLE 5-1 Reliability Assessment of Command Launch Unit of Javelin in Several Testing Situations Test Troop Handling Mean Time Between Operational Mission Failures (in hours) RQT I None 63 RQT II None 482 RDGT None 189 PPQT None 89 DBT Limited 78 FDTE Limited 50 IOT Typical 32 RQT I Reliability Qualification Test I RQT II Reliability Qualification Test II RDGT Reliability Development Growth Test PPQT Preproduction Qualification Test DBT Dirty Battlefield Test FDTE Force Development Test and Experimentation IOT Initial Operational Test morale, state of alerting or fear, and anticipation of the scoring rules to be used), and require extra-statistical analysis. Simulations Cannot Identify the “Unknown Unknowns” Information one gains from simulation is obviously limited by the information put into the simulation. While simulations can be an important adjunct to testing when appropriately validated for the purpose for which they are used, no simulation can discover a system problem that arises from factors not included in the models on which the simulation is built. As an example, one system experienced unexpected reliability problems in field tests because soldiers were using an antenna as a handle, causing it to break. This kind of problem would rarely be discovered by means of a simulation. As another example, consider Table 5-1, which represents the mean time between operational mission failures for the command launch unit of the Javelin (a man-portable anti-tank missile). Note that as troop handling grows to become typical of use in the field, the mean time between operational mission failure decreases. It is therefore reasonable to assume that the failure modes differ for the various test situations (granting that some were removed during the development process). However, since a simulation, designed to incorporate reliability, would most likely include failures typical of developmental testing (rather than operational testing), such a simulation could never replace operational tests. Thus the challenge is to identify the most appropriate ways simulation can be used in concert with field tests in an overall cost-effective approach to testing. FUTURE WORK The panel is concerned that (1) rigorous validation of models and simulations for operational testing is infrequent, and external validation is at times used to overfit a model to field experience; (2) there is
OCR for page 40
Statistical Methods for Testing and Evaluating Defense Systems: Interim Report little evidence seen for use of statistical methods to help interpret results from simulations; (3) the literature on the use of simulations is deficient in its statistical content; and (4) simulations cannot identify the “unknown unknowns.” A number of positions can be taken on the question of the use of simulation for operational testing. These range from (1) simulation is the future of operational testing; (2) simulation, when properly validated, can play a role in assessing effectiveness, but not suitability; (3) simulation can be useful in helping to identify the scenarios most important to test; (4) simulation can be useful in planning operational tests only with respect to effectiveness; and (5) simulation in its current state is relatively useless in operational testing. The panel is not ready to express its position on this question. Simulations typically do not consider reliability, availability, and maintainability issues; do not control carefully for human factors; and are not “consolidative models” (Bankes, 1993), that is, do not consolidate known facts about a system and cannot, for the purpose at hand, safely be used as surrogates for the system itself. Simulations often are not entirely based on physical, engineering, or chemical models of the system or system components. Clearly the absence of these features reduces the utility of the simulations for operational testing. This does raise questions: At what level should a simulation focus in order to replicate, as well as possible, the operational behavior of a system? Should the simulation model the entire system or individual components? The panel is more optimistic about simulation of system components rather than entire systems. As our work goes forward, we need to expand our understanding of current practice in the Navy and Air Force, especially with respect to their validation of simulations. We will also examine the use of distributed interactive simulation in operational testing to determine the particular application of statistics in that area. Key issues requiring further investigation because of their complexity include the proper role of simulation in operational testing, the combination of information from field tests and results from simulations, and proper use of probabilistic and statistical methodology in simulation. To accomplish the above, we intend to meet with experts on simulation from the Navy Operational Test and Evaluation Force and the Air Force Operational Test and Evaluation Center to determine their current procedures, and to examine the procedures used to validate simulations used or proposed for use in operational testing. We will also meet with experts from the Institute of Defense Analyses and DoD on simulation to solicit their views on the proper role of simulation in operational testing.
Representative terms from entire chapter: