a system that may still have problems, versus holding back a system that is performing acceptably.

  • Decision makers typically are concerned not just with aggregate measures, but also with the performance of systems in particular environments. While the panel found this concern reflected in the analyses we examined, the reporting of results for individual scenarios and prototypes was almost always informal. The panel was often told that statistics could not be applied to the problem of performance of individual scenarios because sample sizes were too small. Analysts were frequently unaware of formal statistical methods and modeling approaches for making effective use of limited sample sizes.

  • Information from operational tests is infrequently combined with information from developmental tests and test and field performance of related systems. The ability to combine information is hampered by institutional problems, by the lack of a process for archiving test data, and by the lack of standardized reporting procedures.

This chapter discusses problems with current procedures for analyzing and reporting of operational test results, and recommends alternative approaches that can improve the efficiency with which decision relevant information can be extracted from operational tests. The evaluation of the operational readiness of systems can be substantially improved by implementing these changes. In addition, money can be saved through more efficient use of limited test funds.


Significance testing has a number of advantages for presenting the results of operational tests and for deciding whether to pass (defense) systems to full-rate production.1 Significance testing is a long-standing method for assessing whether an estimated quantity is significantly different from an assumed quantity. Therefore, it has utility in evaluating whether the results of an operational test demonstrate the satisfaction of a system requirement. The objectivity of this approach is useful, given the various incentives of participants in the acquisition process. Certainly, if a system performs significantly below its required level, it is a major concern.


In several places in this report, a result from a significance (hypothesis) test, specifically a t-test, is put forward as an operational testing output of primary interest to decision makers for a given measure of performance or effectiveness. When these are produced, that is the role they can play (besides often being used to determine a statistically-based sample size in sample design). However, we point out that in many or most cases, summary statistics (such as means or percentages, especially when they exceed a required level) are viewed as sufficient for input to the decision process; use of significance testing is not customary. Even though significance testing is not uniformly applied, the panel views it as representing current best practice. and that is what we choose to comment on.

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement