| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 76
5
Data Analysis
The panel has noted (see the October 2002 letter report in Appen-
dix A) the importance of determining, prior to the collection of
data, the types of results expected and the data analyses that will be
carried out. This is necessary to ensure that the designed data collection
effort will provide enough information of the right types to allow for a
fruitful evaluation. Failure to think about the data analysis prior to data
collection may result in omitted explanatory or response variables or inad-
equate sample size to provide statistical support for important decisions.
Also, if the questions of interest are not identified in advance but in-
stead are determined by looking at the data, then it is not possible to for-
mally address the questions, using statistical arguments, until an indepen-
dent confirmatory study is carried out.
An important characteristic of the IBCT/Stryker JOT, probably in
common with other defense system evaluations, is that there are a large
number of measures collected during the evaluation. This includes mea-
sures of a variety of types (e.g., counts of events, proportions, binary out-
comes) related to a variety of subjects (e.g., mission performance, casual-
ties, reliability). In addition, there are a large number of questions of
interest. For the IBCT/Stryker JOT, these include: Does the Stryker-
equipped force outperform a baseline force? In which situations does the
Stryker-equipped force have the greatest advantage? Why does the Stryker-
equipped force outperform the baseline force? It is important to avoid
"rolling up" the many measures into a small number of summary measures
76
OCR for page 77
DATA ANALYSIS
77
focused only on certain preidentified critical issues. Instead, appropriate
measures should be used to address each of the many possible questions. It
will sometimes, but certainly not always, be useful to combine measures
into an overall summary measure.
The design discussion in Chapter 4 introduced the important distinc-
tion between the learning phase of a study and the confirmatory phase of a
study. There we recommend that the study proceed in steps or stages rather
than as a single large evaluation. This section focuses on the analysis of the
data collected. The comments here are relevant whether a single evaluation
test is done (as proposed by ATEC) or a series of studies are carried out (as
proposed by the panel).
Another dichotomy that is relevant when analyzing data is that be-
tween the use of formal statistical methods (like significance tests) and the
use of exploratory methods (often graphical). Formal statistical tests and
procedures often play a large role in confirmatory studies (or in the confir-
matory phase described in Chapter 41. Less formal methods, known as
exploratory analysis, are useful for probing the data to detect interesting or
unanticipated data values or patterns. Exploratory analysis is used here in
the broad sense, to include but not to be limited by the methods described
in Tukey (19771. Exploratory methods often make extensive use of graphs
to search for patterns in the data. Exploratory analysis of data is always a
good thing, whether the data are collected as part of a confirmatory study
to compare two forces or as part of a learning phase study to ascertain the
limits of performance for a system.
The remainder of this chapter reviews the general principles behind
the formal statistical procedures used in confirmatory studies and those
methods used in exploratory statistical analyses and then presents some
specific recommendations for data analysis for the IBCT/Stryker JOT.
PRINCIPLES OF DATA ANALYSIS
Formal Statistical Methods in Confirmatory Analyses
A key component of any defense system evaluation is the formal com-
parison of the new system with an appropriately chosen baseline. It is usu-
ally assumed that the new system will outperform the baseline; hence this
portion of the analysis can be thought of as confirmatory. Depending on
the number of factors incorporated in the design, the statistical assessment
could be a two-sample comparison (if there are no other controlled experi-
OCR for page 78
78
IMPROVED OPERATIONAL TESTING AND EVALUATION
mental or measured covariate factors) or a regression analysis (if there are
other factors). In either case, statistical significance tests or confidence in-
tervals are often used to determine if the observed improvement provided
by the new system is too large to have occurred by chance.
Statistical significance tests are commonly used in most scientific fields
as an objective method for assessing the evidence provided by a study. The
National Research Council (NRC) report Statistics, Testing, and(DefenseAc-
quisition reviews the role and limitations of significance testing in defense
testing (National Research Council, 1998a). It is worthwhile to review
some of the issues raised in that report. One of the limitations of signifi-
cance testing is that it is focused on binary decisions: the null hypothesis
(which usually states that there is no difference between the experimental
and baseline systems) is rejected or not. If it is rejected, then the main goal
of the evaluation is achieved, and the data analysis may move to an explor-
atory phase to better understand when and why the new system is better. A
difficulty with the binary decision is that it obscures information about the
size of the improvement afforded by the new system, and it does not recog-
nize the difference between statistical significance and practical significance.
The outcome of a significance test is determined both by the amount of
improvement observed and by the sample size. Failure to find a statistically
significant difference may be because the observed improvement is less than
anticipated or because the sample size was not sufficient. Confidence inter-
vals that combine an estimate of the improvement provided by the new
system with an estimate of the uncertainty or variability associated with the
estimate generally provide more information. Confidence intervals pro-
vide information about whether the hypothesis of"no difference" is plau-
sible given the data (as do significance tests) but also inform about the
likely size of the improvement provided by the system and its practical
significance. Thus confidence intervals should be used with or in place of
. . ~
slgnlilcance tests.
Other difficulties in using and interpreting the results of significance
tests are related to the fact that the two hypotheses are not treated equally.
Most significance test calculations are computed under the assumption that
the null hypothesis is correct. Tests are typically constructed so that a rejec-
tion ofthe null hypothesis confirms the alternative that we believe (or hope)
to be true. The alternative hypothesis is used to suggest the nature of the
test and to define the region of values for which the null hypothesis is
rejected. Occasionally the alternative hypothesis also figures in statistical
OCR for page 79
DATA ANALYSIS
79
power calculations to determine the minimum sample size required in or-
der to be able to detect differences of practical significance. Carrying out
tests in this way requires trading off the chances of making two possible
errors: rejecting the null hypothesis when it is true and failing to reject the
null hypothesis when it is false. Often in practice, little time is spent deter-
mining the relative cost of these two types of errors, and as a consequence
only the first is taken into account and reported.
The large number of outcomes being assessed can further complicate
carrying out significance tests. Traditional significance tests often are de-
signed with a 5 or 10 percent error rate, so that significant differences are
declared to be in error only infrequently. However, this also means that if
formal comparisons are made for each of 20 or more outcome measures,
then the probability of an error in one or more of the decisions can become
quite high. Multiple comparison procedures allow for control ofthe experi-
ment-wide error rate by reducing the acceptable error rate for each indi-
vidual comparison. Because this makes the individual tests more conserva-
tive, it is important to determine whether formal significance tests are
required for the many outcome measures. If we think of the analysis as
comprising a confirmatory and exploratory phase, then it should be pos-
sible to restrict significance testing to a small number of outcomes in the
confirmatory phase. The exploratory phase can focus on investigating the
scenarios for which improvement seems greatest using confidence intervals
and graphical techniques. In fact, we may know in advance that there are
some scenarios for which the IBCT/Stryker and baseline performance will
not differ, for example, in low-intensity military operations; it does not
make sense to carry out significance tests when we expect that the null
hypothesis is true or nearly true.
It is also clearly important to identify the proper unit of analysis in
carrying out statistical analyses. Often data are collected at several different
levels in a study. For example, one might collect data about individual
soldiers (especially casualty status), platoons, companies, etc. For many
outcome measures, the data about individual soldiers will not be indepen-
dent, because they share the same assignment. This has important implica-
tions for data analysis in that most statistical methods require independent
observations. This point is discussed in Chapter 4 in the context of study
design and is revisited below in discussing data analysis specifics for the
IBCT/Stryker JOT.
OCR for page 80
80
IMPROVED OPERATIONAL TESTING AND EVALUATION
Exploratory Analyses
Conclusions obtained from the IOT should not stop with the confir-
mation that the new system performs better than the baseline. Operational
tests also provide an opportunity to learn about the operating characteris-
tics of new systems/forces. Exploratory analyses facilitate learning by mak-
ing use of graphical techniques to examine the large number of variables
and scenarios. For the IBCT/Stryker JOT, it is of interest to determine the
factors (mission intensity, environment, mission type, and force) that im-
pact IBCT/Stryker and the effects of these factors. Given the large number
of factors and the many outcome measures, the importance of the explor-
atory phase of the data analysis should not be underestimated.
In fact, it is not even correct to assume (as has been done in this chap-
ter) that formal confirmatory tests will be done prior to exploratory data
analysis. Examination of data, especially using graphs, can allow investiga-
tors to determine whether the assumptions required for formal statistical
procedures are satisfied and identify incorrect or suspect observations. This
ensures that appropriate methodology is used in the important confirma-
tory analyses. The remainder of this section assumes that this important
part of exploratory analysis has been carried out prior to the use of formal
statistical tests and procedures. The focus here is on another crucial use of
exploratory methods, namely, to identify data patterns that may suggest
previously unseen advantages or disadvantages for one force or the other.
Tukey (1977) and Chambers et al. (1983) describe an extensive collec-
tion of tools and examples for using graphical methods in exploratory data
analysis. These methods provide a mechanism for looking at the data to
identify interesting results and patterns that provide insight about the sys-
tem under study. Graphs displaying a single outcome measure against a
variety of factors can identify subsets of the design space (i.e., combinations
of factors) for which the improvement provided by a new system is notice-
ably high or low. Such graphs can also identify data collection or recording
errors and unexpected aspects of system performance.
Another type of graphical display presents several measures in a single
graph (for example, parallel box plots for the different measures or the same
measures for different groups). Such graphs can identify sets of outcome
measures that show the same pattern of responses to the factors, and so can
help confirm either that these measures are all correlated with mission suc-
cess as expected, or may identify new combinations of measures worthy of
consideration. When an exploratory analysis of many independent mea-
OCR for page 81
DATA ANALYSIS
81
sures shows results consistent with a priori expectations but not statistically
significant, these results might in combination reinforce one another if they
could all be attributed to the same underlying cause.
It should be pointed out that exploratory analysis can include formal
multivariate statistical methods, such as principal components analysis, to
determine which measures appear to correlate highly across mission sce-
narios (see, for example, Johnson and Wichern, 19921. One might iden-
tify combinations of measures that appear to correlate well with the ratings
of SMEs, in this way providing a form of objective confirmation of the
implicit combination of information done by the experts.
Reliability and Maintainability
These general comments above regarding confirmatory and explor-
atory analysis apply to all types of outcome measures, including those asso-
ciated with reliability and maintainability, although the actual statistical
techniques used may vary. For example, the use of exponential or Weibull
data models is common in reliability work, while normal data models are
often dominant in other fields. Meeker and Escobar (1998) provide an
excellent discussion of statistical methods for reliability.
A key aspect of complex systems like Stryker that impacts reliability,
availability, and maintainability data analysis is the large number of failure
modes that affect reliability and availability (discussed also in Chapter 31.
These failure modes can be expected to have different behavior. Failure
modes due to wear would have increasing hazard over time, whereas other
modes would have decreasing hazard over time (as defects are fixed). Rather
than using statistical models to directly model system-wide failures, each of
the major failure modes should be modeled. Inferences about system-wide
reliability would then be obtained by combining information from the dif-
ferent modes.
Thinking about exploratory analysis for reliability and maintainability
data raises important issues about data collection. Data regarding the reli-
ability of a vehicle or system should be collected from the start of opera-
tions and tracked through the lifetime of the vehicle, including training
uses of the vehicle, operational tests, and ultimately operational use. It is a
challenge to collect data in this way and maintain it in a common database,
but the ability to do so has important ramifications for reliability modeling.
It is also important to keep maintenance records as well, so that the times
between maintenance and failures are available.
OCR for page 82
82
IMPROVED OPERATIONAL TESTING AND EVALUATION
Modeling anal Simulation
Evaluation plans often rely on modeling and simulation to address
several aspects of the system being evaluated. Data from the operational
test may be needed to run the simulation models that address some issues,
but certainly not all; for example, no new data are needed for studying
transportability of the system. Information from an operational test may
also identify an issue that was not anticipated in pretest simulation work,
and this could then be used to refine or improve the simulation models.
In addition, modeling and simulation can be used to better under-
stand operational test results and to extrapolate to larger units. This is done
by using data from the operational test to recreate and/or visualize test
events. The recreated events may then be further probed via simulation. In
addition, data (e.g., on the distributions of events) can be used to run
through simulation programs and assess factors likely to be important at
the brigade level. Care should be taken to assess the uncertainty effect of
the limited sample size results from the IOT on the scaled-up simulations.
ANALYSIS OF DATA FROM THE IBCT/STRYKER IOT
This section addresses more specifically the analysis of data to be col-
lected from the IBCT/Stryker JOT. Comments here are based primarily
on information provided to the panel in various documents (see Chapter 1)
and briefings by ATEC that describe the test and evaluation plans for the
IBCT/Stryker.
Confirmatory Analysis
ATEC has provided us with detailed plans describing the intended
analysis of the SME scores of mission outcomes and mission casualty rates.
These plans are discussed here.
The discussion of general principles in the preceding section comments
on the importance of defining the appropriate unit for data analysis. The
ATEC-designed evaluation consists basically of 36 missions for the Stryker-
equipped force and 36 missions for the baseline force (and the 6 additional
missions in the ATEC design reserved for special studies). These missions
are defined by a mix of factors, including mission type (raid, perimeter
defense, area presence), mission intensity (high, medium, low), location
(rural, urban), and company pair (B. C). The planned analysis of SME
OCR for page 83
DATA ANALYSIS
83
mission scores uses the mission as the basic unit. This seems reasonable,
although it may be possible to carry out some data analysis using company-
level or platoon-level data or using events within missions (as described in
Chapter 41. The planned analysis of casualty rates appears to work with the
individual soldier as the unit of analysis. In the panel's view this is incorrect
because there is sure to be dependence among the outcomes for different
soldiers. Therefore, a single casualty rate should be computed for each
mission (or for other units that might be deemed to yield independent
information) and these should be analyzed in the manner currently planned
for the SME scores.
Several important data issues should be considered by ATEC analysts.
These are primarily related to the SME scores. Confirmatory analyses are
often based on the assumptions that there is a continuous or at least or-
dered categorical measurement scale (although they are often done with
Poisson or binomial data) and that the measurements on that scale are
subject to measurement error that has constant variance (independent of
the measured value). The SME scores provide an ordinal scale such that a
mission success score of 8 is better than a score of 7, which is better than a
score of 6. It is not clear that the scale can be considered an interval scale in
which the difference between an 8 and 7 and between a 7 and 6 are the
same. In fact, anecdotal evidence was presented to the panel suggesting
that scores 5 through 8 are viewed as successes, and scores 1 through 4 are
viewed as failures, which would imply a large gap between 4 and 5. One
might also expect differences in the level of variation observed at different
points along the scale, for two reasons. First, data values near either end of
a scale (e.g., 1 or 8 in the present case) tend to have less measurement
variation than those in the middle of the scale. One way to argue this is to
note that all observers are likely to agree on judgments of missions with
scores of 7 or 8, while there may be more variation on judgments about
missions in the middle of the scoring scale (one expert's 3 might be another's
51. Second, the missions are of differing length and complexity. It is quite
likely that the scores of longer missions may have more variability than
those of shorter missions. Casualty rates, as proportions, are also likely to
exhibit nonconstant variance. There is less variation in a low casualty rate
(or an extremely high one) and more variation for a casualty rate away from
the extremes. Transformations of SME scores or casualty rates should be
considered if nonconstant variance is determined to be a problem.
The intended ATEC analysis focuses on the difference between IBCT/
Stryker force outcomes and baseline force outcomes for the 36 missions.
OCR for page 84
84
IMPROVED OPERATIONAL TESTING AND EVALUATION
By working with differences, the main effects of the various factors are
eliminated, providing for more precise measurement of system effective-
ness. Note that variation due to interactions, that is to say variation in the
benefits provided by IBCT/Stryker over different scenarios, must be ad-
dressed through a statistical model. The appropriate analysis, which ap-
pears to be part of ATEC plans, is a linear model that relates the difference
scores (that is, the difference between the IBCT and baseline performance
measures on the same mission) to the effects of the various factors. The
estimated residual variance from such a model provides the best estimate of
the amount of variation in outcome that would be expected if missions
were repeated under the same conditions. This is not the same as simply
computing the variance of the 36 differences, as that variance would be
inflated by the degree to which the IBCT/Stryker advantage varies across
scenarios. The model would be likely to be of the form
Di= difference score for mission i = overall mean + mission type effect
+ mission intensity effect + location effect + company effect + other desired
. .
Interactions + error
The estimated overall mean is the average improvement afforded by
IBCT/Stryker relative to the baseline. The null hypothesis of no difference
(overall mean = 0) would be tested using traditional methods. Additional
parameters measure the degree to which IBCT/Stryker improvement varies
by mission type, mission intensity, location, company, etc. These addi-
tional parameters can be tested for significance or, as suggested above, esti-
mates for the various factor effects can be reported along with estimates of
their precision to aid in the judgment of practically significant results. This
same basic model can be applied to other continuous measures, including
casualty rate, subject to earlier concerns about homogeneity of variance.
This discussion ignores the six additional missions for each force.
These can also be included and would provide additional degrees of free-
dom and improved error variance estimates.
Exploratory Analysis
It is anticipated that IBCT/Stryker will outperform the baseline. As-
suming that result is obtained, the focus will shift to determining under
which scenarios Stryker helps most and why. This is likely to be deter-
mined by careful analysis of the many measures and scenarios. In particu-
OCR for page 85
DATA ANALYSIS
85
far, it seems valuable to examine the IBCT unit scores, baseline unit scores,
and differences graphically to identify any unusual values or scenarios.
Such graphical displays will complement the results of the confirmatory
analyses described above.
In addition, the exploratory analysis provides an opportunity to con-
sider the wide range of measures available. Thus, in addition to SME scores
of mission success, other measures (as described in Chapter 3) could be
used. By looking at graphs showing the relationship of mission outcome
and factors like intensity simultaneously for multiple outcomes, it should
be possible to learn more about IBCT/Stryker's strengths and vulnerabili-
ties. However, the real significance of any such insights would need to be
confirmed by additional testing.
Reliability and Maintainability
Reliability and maintainability analyses are likely to be focused on as-
sessing the degree to which Stryker meets the design specifications. Tradi-
tional reliability methods will be useful in this regard. The general prin-
ciples discussed earlier concerning separate modeling for different failure
modes is important. It is also important to explore the reliability data
across vehicle types to identify groups of vehicles that may share common
reliability profiles or, conversely, those with unique reliability problems.
Modeling and Simulation
ATEC has provided little detail about how the IBCT/Stryker IOT data
might be used in post-IOT simulations, so we do not discuss this issue.
This leaves open the question of whether and how operational test data can
be extrapolated to yield information about larger scale operations.
SUMMARY
The IBCT/Stryker IOT is designed to serve two major purposes: (1)
confirmation that the Stryker-equipped force will outperform the Light
Infantry Brigade baseline, and estimation of the amount by which it will
outperform and (2) exploring the performance of the IBCT to learn about
the performance capabilities and limitations of Stryker. Statistical signifi-
cance tests are useful in the confirmatory analysis comparing the Stryker-
equipped and baseline forces. In general, however, the issues raised by the
1998 NRC panel suggest that more use should be made of estimates and
OCR for page 86
86
IMPROVED OPERATIONAL TESTING AND EVALUATION
associated measures of precision (or confidence intervals) in addition to
significance tests because the former enable the judging of the practical
significance of observed effects. There is a great deal to be learned by
exploratory analysis of the IOT data, especially using graphical methods.
The data may instruct ATEC about the relative advantage of IBCT/Stryker
in different scenarios as well as any unusual events during the operational
test.
.
We call attention to several key issues:
1. The IBCT/Stryker IOT involves the collection of a large number
of measures intended to address a wide variety of issues. The measures
should be used to address relevant issues without being rolled up into over-
all summaries until necessary.
2. The statistical methods to be used by ATEC are designed for
independent study units. In particular, it is not appropriate to compare
casualty rates by simply aggregating indicators for each soldier over a set of
missions. Casualty rates should be calculated for each mission (or possibly
for discrete events of shorter duration) and these used in subsequent data
analyses.
3. The IOT provides little vehicle operating data and thus may not
be sufficient to address all of the reliability and maintainability concerns of
ATEC. This highlights the need for improved data collection regarding
vehicle usage. In particular, data should be maintained for each vehicle
over that vehicle's entire life, including training, testing, and ultimately
field use; data should also be gathered separately for different failure modes.
4. The panel reaffirms the recommendation of the 1998 NRC panel
that more use should be made of estimates and associated measures of pre-
cision (or confidence intervals) in addition to significance tests, because the
former enable the judging of the practical significance of observed effects.
Representative terms from entire chapter:
exploratory analysis