| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 87
6
Assessing the IBCT/Stryker Operational
Test in a Broad Context
In our work reported here, the Panel on the Operational Test Design
and Evaluation of the Interim Armored Vehicle has used the
report of the Panel on Statistical Methods for Testing and Evaluating
Defense Systems (National Research Council, 1998a, referred to in this
chapter as NRC 1998) to guide our thinking about evaluating the IBCT/
Stryker Initial Operational Test (IOT). Consistent with our charge, we
view our work as a case study of how the principles and practices put for-
ward by the previous panel apply to the operational test design and evalua-
tion of IBCT/Stryker. In this context, we have examined the measures,
design, and evaluation strategy of IBCT/Stryker in light of the conclusions
and recommendations put forward in NRC 1998 with the goal of deriving
more general findings of broad applicability in the defense test and evalua-
. .
tlon community.
From a practical point of view, it is clear that several of the ideas put
forward in NRC 1998 for improvement of the measures and test design
cannot be implemented in the IBCT/Stryker IOT due to various con-
straints, especially time limitations. However, by viewing the Styker test as
an opportunity to gain additional insights into how to do good opera-
tional test design and evaluation, our panel hopes to further sharpen and
disseminate the ideas contained in NRC 1998. In addition, this perspec-
tive will demonstrate that nearly all of the recommendations contained in
this report are based on generally accepted principles of test design and
evaluation.
87
OCR for page 88
88
IMPROVED OPERATIONAL TESTING AND EVALUATION
Although we note that many of the recommendations contained in
NRC 1998 have not been fully acted on by ATEC or by the broader de-
fense test and evaluation community, this is not meant as criticism. The
paradigm shift called for in that report could not have been implemented
in the short amount of time since it has been available. Instead, our aim is
to more clearly communicate the principles and practices contained in NRC
1998 to the broad defense acquisition community, so that the changes sug-
gested will be more widely understood and adopted.
A RECOMMENDED PARADIGM FOR
TESTING AND EVALUATION
Operational tests, by necessity, are often large, very complicated, and
expensive undertakings. The primary contribution of an operational test
to the accumulated evidence about a defense system's operational suitabil-
ity and effectiveness that exist a priori is that it is the only objective assess-
ment of the interaction between the soldier and the complete system as it
will be used in the field. It is well known that a number of failure modes
and other considerations that affect a system's performance are best (or
even uniquely) exhibited under these conditions. For this reason, Conclu-
sion 2.3 of NRC 1998 states: "operational testing is essential for defense
. . ..
system evaluation.
Operational tests have been put forward as tests that can, in isolation
from other sources of information, provide confirmatory statistical "proof"
that specific operational requirements have been met. However, a major
finding of NRC 1998 is that, given the test size that is typical of the opera-
tional tests of large Acquisition Category I (ACAT I) systems and the het-
erogeneity of the performance of these systems across environments of use,
users, tactics, and doctrine, operational tests cannot, generally speaking,
satisfy this role.1 Instead, the test and evaluation process should be viewed
as a continuous process of information collection, analysis, and decision
making, starting with information collected from field experience of the
Conclusion 2.2 of the NRC 1998 report states: "The operational test and evaluation
requirement, stated in law, that the Director, Operational Test and Evaluation certify that a
system is operationally effective and suitable often cannot be supported solely by the use of
standard statistical measures of confidence for complex defense systems with reasonable
amounts of testing resources" (p. 33).
OCR for page 89
ASSESSING THEIBCT/STRYKER OPERATIONAL TESTINA BROAD CONTEXT 89
baseline and similar systems, and systems with similar or identical compo-
nents, through contractor testing of the system in question, and then
through developmental testing and operational testing (and in some sense
continued after fielding forward to field performance).
Perhaps the most widely used statistical method for supporting deci-
sions made from operational test results is significance testing. Significance
testing is flawed in this application because of inadequate test sample size to
detect differences of practical importance (see NRC, 1998:88-91), and
because it focuses attention inappropriately on a pass/fail decision rather
than on learning about the system's performance in a variety of settings.
Also, significance testing answers the wrong question not whether the
system's performance satisfies its requirements but whether the system's per-
formance is inconsistent with failure to meet its requirements and signifi-
cance testing fails to balance the risk of accepting a "bad" system against the
risk of rejecting a "good" system. Significance tests are designed to detect
statistically significant differences from requirements, but they do not ad-
dress whether any differences that may be detected are practically signifi-
cant.
The DoD milestone process must be rethought, in order to replace the
fundamental role that significance testing currently plays in the pass/fail
decision with a fuller exploration of the consequences of the various pos-
sible decisions. Significance tests and confidence intervals2 provide useful
information, but they should be augmented by other numeric and analytic
assessments using all information available, especially from other tests and
trials. An effective formal decision-making framework could use, for ex-
ample, significance testing augmented by assessments of the likelihood of
various hypotheses about the performance of the system under test (and
the baseline system), as well as the costs of making various decisions based
on whether the various alternatives are true. Moreover, designs used in
operational testing are not usually constructed to inform the actual deci-
sions that operational test is intended to support. For example, if a new
system is supposed to outperform a baseline in specific types of environ-
ments, the test should provide sufficient test sample in those environments
to determine whether the advantages have been realized, if necessary at the
methods.
2Producing confidence intervals for sophisticated estimates often requires resampling
OCR for page 90
90
IMPROVED OPERATIONAL TESTING AND EVALUATION
cost of test sample in environments where the system is only supposed to
equal the baseline.
Testing the IBCT/Stryker is even more complicated than many ACAT
I systems in that it is really a test of a system of systems, not simply a test of
what Stryker itself is capable of. It is therefore no surprise that the size of
the operational test (i.e., the number of test replications) for IBCT/Stryker
will be inadequate to support many significance tests that could be used to
base decisions on whether Stryker should be passed to full-rate production.
Such decisions therefore need to be supplemented with information from
the other sources, mentioned above.
This argument about the role of significance testing is even more im-
portant for systems such as the Stryker that are placed into operational
testing when the system's performance (much less its physical characteris-
tics) has not matured, since then the test size needs to be larger to achieve
reasonable power levels. When a fully mature system is placed into opera-
tional testing, the test is more of a confirmatory exercise, a shakedown test,
since it is essentially understood that the requirements are very likely to be
met, and the test can then focus on achieving a greater understanding of
how the system performs in various environments.
Recommendation 3.3 of NRC 1998 argued strongly that information
should be used and appropriately combined from all phases of system de-
velopment and testing, and that this information needs to be properly
archived to facilitate retrieval and use. In the case of the IBCT/Stryker
JOT, it is clear that this has not occurred, as evidenced by the difficulty
ATEC has had in accessing relevant information from contractor testing
and, indeed, operational experiences from allies using predecessor systems
(e.g., the Canadian LAY-III).
HOW IBCT/STRYKER IOT CONFORMS WITH
RECOMMENDATIONS FROM THE NRC 1998 REPORT
Preliminaries to Testing
The new paradigm articulated in NRC 1998 argues that defense sys-
tems should not enter into operational testing unless the system design is
relatively mature. This maturation should be expedited through previous
testing that incorporates various aspects of operational realism in addition
to the usual developmental testing. The role, then, for operational testing
would be to confirm the results from this earlier testing and to learn more
OCR for page 91
ASSESSING THEIBCT/STRYKER OPERATIONAL TESTINA BROAD CONTEXT 91
about how to operate the system in different environments and what the
system's limitations are. The panel believes that in some important respects
Stryker is not likely to be fully ready for operational testing when that is
scheduled to begin. This is because:
1. many of the vehicle types have not yet been examined for their
suitability, having been driven only a fraction of the required mean miles to
failure (1,000 miles);
2. the use of add-on armor has not been adequately tested prior to the
operational test;
3. it is still not clear how IBCT/Stryker needs to be used in various
types of scenarios, given the incomplete development of its tactics and doc-
trine; and
4. the GFE systems providing situation awareness have not been suffi-
ciently tested to guarantee that the software has adequate reliability.
The role of operational test as a confirmatory exercise has therefore not
been realized for IBCT/Stryker. This does not necessarily mean that the
IOT should be postponed, since the decision to go to operational test is
based on a number of additional considerations. However, it does mean
that the operational test is being run with some additional complications
that could reduce its effectiveness.
Besides system maturity, another prerequisite for an operational test is
a full understanding of the factors that affect system performance. While
ATEC clearly understands the most crucial factors that will contribute to
variation in system performance (intensity, urban/rural, day/night, terrain,
and mission type), it is not clear whether they have carried out a systematic
test planning exercise, including (quoting from NRC, 1998a:64-651: "~1)
defining the purpose of the test; . . . (4) using previous information to
compare variation within and across environments, and to understand sys-
tem performance as a function of test factors; . . . and (6) use of small-scale
screening or guiding tests for collecting information on test planning."
Also, as mentioned in Chapter 4, it is not yet clear that the test design
and the subsequent test analysis have been linked. For example, if perfor-
mance in a specific environment is key to the evaluation of IBCT/Stryker,
more test replications will need to be allocated to that environment. In
addition, while the main factors affecting performance have been identi-
fied, factors such as season, day versus night, and learning effects were not,
OCR for page 92
92
IMPROVED OPERATIONAL TESTING AND EVALUATION
at least initially, explicitly controlled for. This issue was raised in the panel's
letter report (Appendix A).
Test Design
This section discusses two issues relevant to test design: the basic test
design and the allocation of test replications to design cells. First, ATEC
has decided to use a balanced design to give it the most flexibility in esti-
mating the variety of main effects of interest. As a result, the effects of
terrain, intensity, mission, and scenario on the performance of these sys-
tems will be jointly estimated quite well, given the test sample size. How-
ever, at this point in system development, ATEC does not appear to know
which of these factors matter more and/or less, or where the major uncer-
tainties lie. Thus, it may be that there is only a minority of environments
in which IBCT/Stryker offers distinct advantages, in which case those en-
vironments could be more thoroughly tested to achieve a better under-
standing of its advantages in those situations. Specific questions of inter-
est, such as the value of situation awareness in explaining the advantage of
IBCT/Stryker, can be addressed by designing and running small side ex-
periments (which might also be addressed prior to a final operational test).
This last suggestion is based on Recommendation 3.4 of the NRC 1998
report (p. 491: "All services should explore the adoption of the use of small-
scale testing similar to the Army concept of force development test and
experimentation. "
Modeling and simulation are discussed in NRC 1998 as an important
tool in test planning. ATEC should take better advantage of information
from modeling and simulation, as well as from developmental testing, that
could be very useful for the IBCT/Stryker test planning. This includes
information as to when the benefits of the IBCT/Stryker over the baseline
are likely to be important but not well established.
Finally, in designing a test, the goals of the test have to be kept in
mind. If the goal of an operational test is to learn about system capabilities,
then test replications should be focused on those environments in which
the most can be learned about how the system's capabilities provide advan-
tages. For example, if IBCT/Stryker is intended primarily as an urban
system, more replications should be allocated to urban environments than
to rural ones. We understand ATEC's view that its operational test designs
must allocate, to the extent possible, replications to environments in accor-
dance with the allocation of expected field use, as presented in the OMS/
OCR for page 93
ASSESSING THEIBCT/STRYKER OPERATIONAL TESTINA BROAD CONTEXT 93
MP. In our judgment the OMS/MP need only refer to the operational
evaluation, and certainly once estimates of test performance in each envi-
ronment are derived, they can be reweighted to correspond to summary
measures defined by the OMS/MP (which may still be criticized for focus-
ing too much on such summary measures in comparison to more detailed
assessments).
Furthermore, there are substantial advantages obtained with respect to
designing operational tests by separating the two goals of confirming that
various requirements have been met and of learning as much as possible
about the capabilities and possible deficiencies of the system before going
to full-rate production. That separation allows the designs for these two
separate tests to target these two distinct objectives.
Given the recent emphasis in DoD acquisition on spiral development,
it is interesting to speculate about how staged testing might be incorpo-
rated into this management concept. One possibility is a test strategy in
which the learning phase makes use of early prototypes of the subsequent
stage of development.
System Suitability
Recommendation 7.1 of NRC 1998 states (p. 1051:
The Department of Defense and the military services should give increased
attention to their reliability, availability, and maintainability data collection
and analysis procedures because deficiencies continue to be responsible for
many of the current field problems and concerns about military readiness.
While criticizing developmental and operational test design as being
too focused on evaluation of system effectiveness at the expense of evalua-
tion of system suitability, this recommendation is not meant to suggest that
operational tests should be strongly geared toward estimation of system
suitability, since these large-scale exercises cannot be expected to run long
enough to estimate fatigue life, etc. However, developmental testing can
give measurement of system (operational) suitability a greater priority and
can be structured to provide its test events with greater operational realism.
Use of developmental test events with greater operational realism also
should facilitate development of models for combining information, the
topic of this panel's next report.
The NRC 1998 report also criticized the test and evaluation commu-
nity for relying too heavily on the assumption that the interarrival time for
OCR for page 94
94
IMPROVED OPERATIONAL TESTING AND EVALUATION
initial failures follows an exponential distribution. The requirement for
Stryker of 1,000 mean miles between failures makes sense as a relevant
measure only if ATEC is relying on the assumption of exponentially dis-
tributed times to failure. Given that Stryker, being essentially a mechanical
system, will not have exponentially distributed times to failure, due to
wearout, the actual distribution of waiting times to failure needs to be esti-
mated and presented to decision makers so that they understand its range
of performance. Along the same lines, Stryker will, in all probability, be
repaired during the operational test and returned to action. Understanding
the variation in suitability between a repaired and a new system should be
an important part of the operational test.
Testing of Software-Intensive Systems
The panel has been told that obtaining information about the perfor-
mance of GFE is not a priority of the JOT: GFE will be assumed to have
well-estimated performance parameters, so the test should focus on the
non-GFE components of Stryker. One of the components of Stryker's
GFE is the software providing Stryker with situation awareness. A primary
assumption underlying the argument for the development of Stryker was
that the increased vulnerability of IBCT/Stryker (due to its reduced armor)
is offset by the benefits gained from the enhanced firepower and defensive
positions that Stryker will have due to its greater awareness of the place-
ment of friendly and enemy forces. There is some evidence (FBCB2 test
results) that this situation awareness capability is not fully mature at this
date. It would therefore not be surprising if newly developed, complex
software will suffer reliability or other performance problems that will not
be fully resolved prior to the start of operational testing.
NRC 1998 details procedures that need to be more widely adopted for
the development and testing of software-intensive systems, including us-
age-based testing. Further, Recommendation 8.4 of that report urges that
software failures in the field should be collected and analyzed. Making use
of the information on situation awareness collected during training exer-
cises and in contractor and developmental testing in the operational test
design would have helped in the more comprehensive assessment of the
performance of IBCT/Stryker. For example, allocating test replications to
situations in which previous difficulties in situation awareness had been
experienced would have been very informative as to whether the system is
effective enough to pass to full-rate production.
OCR for page 95
ASSESSING THEIBCT/STRYKER OPERATIONAL TESTINA BROAD CONTEXT 95
Greater Access to Statistical Expertise in
Operational Test and Evaluation
Stryker, if fully procured, will be a multibillion dollar system. Clearly,
the decision on whether to pass Stryker to full-rate production is extremely
important. Therefore, the operational test design and evaluation for Stryker
needs to be representative of the best possible current practice. The statisti-
cal resources allocated to this task were extremely limited. The enlistment
of the National Research Council for high-level review of the test design
and evaluation plans is commendable. However, this does not substitute
for detailed, hands-on, expert attention by a cadre of personnel trained in
statistics with "ownership" of the design and subsequent test and evalua-
tion. ATEC should give a high priority to developing a contractual rela-
tionship with leading practitioners in the fields of reliability estimation,
experimental design, and methods for combining information to help them
in future IOTs. (Chapter 10 of NRC 1998 discusses this issue.)
SUMMARY
The role of operational testing as a confirmatory exercise evaluating a
mature system design has not been realized for IBCT/Stryker. This does
not necessarily mean that the IOT should be postponed, since the decision
to go to operational testing is based on a number of additional consider-
ations. However, it does mean that the operational test is being asked to
provide more information than can be expected. The IOT may illuminate
potential problems with the IBCT and Stryker, but it may not be able to
convincingly demonstrate system effectiveness.
We understand ATEC's view that its operational test designs must allo-
cate, to the extent possible, replications to environments in accordance with
the allocation of expected field use, as presented in the OMS/MP. In the
panel's judgment, the OMS/MP need only refer to the operational evalua-
tion, and once estimates of test performance in each environment are de-
rived, they can be reweighted to correspond to summary measures defined
by the OMS/MP.
We call attention to a number of key points:
1. Operational tests should not be strongly geared toward estimation
OCR for page 96
96
IMPROVED OPERATIONAL TESTING AND EVALUATION
of system suitability, since they cannot be expected to run long enough to
estimate fatigue life, estimate repair and replacement times, identify failure
modes, etc. Therefore, developmental testing should give greater priority
to measurement of system (operational) suitability and should be struc-
tured to provide its test events with greater operational realism.
2. Since the size ofthe operational test (i.e., the number of test replica-
tions) for IBCT/Stryker will be inadequate to support significance tests
leading to a decision on whether Stryker should be passed to full-rate pro-
duction, ATEC should augment this decision by other numerical and
graphical assessments from this IOT and other tests and trials.
3. In general, complex systems should not be forwarded to operational
testing, absent strategic considerations, until the system design is relatively
mature. Forwarding an immature system to operational test is an expensive
way to discover errors that could have been detected in developmental test-
ing, and it reduces the ability of an operational test to carry out its proper
function. System maturation should be expedited through previous testing
that incorporates various aspects of operational realism in addition to the
usual developmental testing.
4. Because it is not yet clear that the test design and the subsequent
test analysis have been linked, ATEC should prepare a straw man test evalu-
ation report in advance of test design, as recommended in the panel's Octo-
ber 2002 letter to ATEC (see Appendix A).
5. The goals of the initial operational test need to be more clearly
specified. Two important types of goals for operational test are learning
about system performance and confirming system performance in com-
parison to requirements and in comparison to the performance of baseline
systems. These two different types of goals argue for different stages of
operational test. Furthermore, to improve test designs that address these
different types of goals, information from previous stages of system devel-
opment need to be utilized.
6. To achieve needed detailed, hands-on, expert attention by a cadre of
statistically trained personnel with "ownership" of the design and subse-
quent test and evaluation, the Department of Defense and ATEC in par-
ticular should give a high priority to developing a contractual relationship
with leading practitioners in the fields of reliability estimation, experimen-
tal design, and methods for combining information to help them with fu-
ture IOTs.
Representative terms from entire chapter:
operational testing