| ||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 40
4
Data Analysis to Assess Performance and
to Support Software Improvement
The model-based testing schemes described above will produce a
collection of inputs to and outputs from a software system, the
inputs representing user stimuli and the outputs measures of the
functioning of the software. Data can be collected on a system either in
development or in use, and can then be analyzed to examine a number of
important aspects of software development and performance. It is impor-
tant to use these data to improve software engineering processes, to dis-
cover faults as early as possible in system development, and to monitor
system performance when fielded. The main aspects of software develop-
ment and performance examined at the workshop include: (1) measure-
ment of software risk, (2) measurement of software aging, (3) defect classi-
fication and analysis, and (4) use of Bayesian modeling for prediction of the
costs of software development. These analyses by no means represent all
the uses of test and performance data for a software system, but they do
provide an indication of the breadth of studies that can be carried out.
MEASURING SOFTWARE RISK
When acquiring a new software system, or comparing competing soft-
ware systems for acquisition, it is important to be able to estimate the risk
of software failure. In addressing risk, one assumes that associated with
each input i to the software system there is a cost resulting from the failure
of the software. To determine which inputs will and will not result in
40
OCR for page 41
DATA ANALYSIS
41
system failure, a set of test inputs is typically selected with a contractual
understanding to complete the entire test set (using some test selection
method) and the software is then run using that set. If, for various reasons,
a test regimen ends up incomplete, this incompleteness needs to be ac-
counted for to provide an assessment of the risk of failure for the software.
The interaction of the test selection method, the sometimes incomplete
process of testing for defects, the probability of selection of inputs by users,
and the fact that certain software failures are more costly than others all
raise some interesting measurement issues, which were addressed at the
workshop by Elaine Weyuker of AT&T.
To begin, one can assume either that there is a positive cost of failure,
denoted costs associated with every input i, or that there is a cost chip that
is positive only if that input actually results in a system failure, with the cost
chid being set equal to zero otherwise. (In other words, costs measures the
potential consequences of various types of system failure, regardless of
whether the system would actually fail, and crib is a realized cost that is zero
if the system works with input i.) A further assumption is that one can
estimate the probability that various test inputs occur in field use; such
inputs are referred to collectively as the operational distribution.
Assume also that a test case selection method has been chosen to esti-
mate system risk. This can be done, as discussed below, in a number of
different ways. The selection of a test suite can be thought of as a contrac-
tual agreement that each input contained in the suite must be tried out on
the software. A reasonable and conservative assumption is that any test
cases not run are assumed to have failed had they been applied. This as-
sumption is adopted to counter the worry that one could bias a risk assess-
ment by not running cases that one thought a priori might fail. Any other
missing information is assumed to be replaced by the most conservative
possible value to provide an upper bound on the estimation of risk. In this
way, any cases that should have been run, but weren't, are accounted for
either as if they have been run and failed or as if the worst possible case has
been run and failed, depending on the contract. The cost of running a
software program on test case i is therefore defined to be chid if the program
is run on i, and is defined to be contain otherwise, using this conservative
assumption. The overall estimated risk associated with a software program,
based on testing using some test suite, is defined to be the weighted sum
over test cases of the product of the cost of failure acrid or costlier for test
input i multiplied by the (normalized) probability that test input i would
OCR for page 42
42
INNOVATIONS IN SOFTWARE ENGINEERING
occur in the field (normalized over those inputs actually contained in the
test suite when appropriate, given the contract).
Obviously, it is very difficult or impossible to associate a cost of failure
with every possible input, since the input domain is almost always much
too large. Besides, even though the test suite is generally much smaller
than the entire input domain, it can be large, and as a result associating a
cost with every element of the test suite can be overwhelming. However,
once the test suite has been run, and one can observe which inputs resulted
in failure, one is left with the job of determining the cost of failure for only
a very small number of inputs, those that have been run and failed. This is
an important advantage of this approach. One also knows how the system
failed and therefore the assignment of a cost should be much more feasible.
Weyuker outlined several methods that could be used to select inputs
for testing, categorized into two broad categories: (1) statistical testing,
where the test cases are selected (without replacement) according to an
(assumed) operational distribution, and (2) deterministic testing, where
purposively selected cases represent a given fraction of the probability of all
test cases, sorted either by probability of use or by the risk of use (which is
the product of the cosmic and the probability of use), with the highest p
percent of cases selected for testing, for some p. The rationale behind this is
that these are the inputs that are going to be used most often or of highest
risk and if these result in failure, they are going to have a large impact on
users. In this description we will focus on statistical testing, though ex-
amples for deterministic testing were also discussed.
Under statistical testing, Weyuker distinguished between (a) using the
operational distribution and (b), in a situation of ignorance, using a uni-
form distribution over all possible inputs. She made the argument that
strongly skewed distributions were the norm and that assuming a uniform
distribution as a result of a lack of knowledge of the usage distribution
could strongly bias the resulting risk estimates. This can be clarified using
the following example, represented by the entries in Table 4-1.
Assume that there are 1,000 customers (or categories of customers)
ranked by volume of business. These make up the entire input domain.
Customer i1 represents 35 percent of the business, while customer i1 0OO
provides very little business. Assume a test suite of 100 cases was selected
to be run using statistical testing based on the operational distribution, but
only 99 cases were run (without failure); i4, the test case with the largest
risk and not selected for testing was not run. Then we behave as if i4 is the
OCR for page 43
DATA ANALYSIS
TABLE 4-1 Example of Costs and Operational Distribution for
F. . . . .
lCtltlOUS . .nputs
43
Input Pr C Prx C
it 0.35 5,000 1,750
i2 0.20 4,000 800
is O. 10 1,000 100
i4 0.10 100 10
is 0.10 50 5
in 0 07 40 2.8
i7 0.03 50 1.5
is 0-0 1 1 00 1.0
ig 0.005 5,000 25.0
ilo 0.005 10 0.05
ill 0.004 10 0.04
il2 0.003 10 0.03
ij3 0.003 1 0.003
i,4- iloo 0.01999 1 0.01999
ilo,- iggg 0.00001 1 0.00001
il coo 1 o-7 109 100
input that had been selected as the 100th test case and that it failed. The
risk associated with this software is therefore 100 x 0.10 (1.00/0.9999899),
or roughly 10.
If one instead (mistakenly) assumes that the inputs follow a uniform
distribution, with 100 test cases contracted for, then if 99 test cases were
run with no defects, the risk is 1/100 times the highest chip for an untested
input i, or in this case 107. Here the risk estimate is biased high since that
. . . .
Input IS extreme. .y rare in rep .lty.
Similar considerations arise with deterministic selection of test suites.
A remaining issue is how to estimate the associated field use probabilities,
especially, as is the case in many situations, where the set of possible inputs
or user types is very large. This turns out to be feasible in many situations:
AT&T, for example, regularly collects data on its operational distributions
for large projects. In these applications, the greater challenge is to model
the functioning of the software system in order to understand the risks of
failure.
OCR for page 44
44
INNOVATIONS IN SOFTWARE ENGINEERING
FAULT-TOLERANT SOFTWARE: MEASURING SOFTWARE
AGING AND REJUVENATION
Highly reliable software is needed for applications where defects can be
catastrophic for example, software supporting aircraft control and nuclear
systems. However, trustworthy software is also vital to support common
commercial applications, such as telecommunication and banking systems.
While total fault avoidance can at times be accomplished through use of
good software engineering practices, it can be difficult or impossible to
achieve for particularly large, complex systems. Furthermore, as discussed
above, it is impossible to fully test and verify that software is fault-free.
Therefore, instead of fault-free software, in some situations it might be
more practical to consider development of fault-tolerant software, that is,
software that can accommodate deficiencies. While hardware fault toler-
ance is a well-understood concept, fault tolerance is a relatively new, unex-
plored area for software systems. Many techniques are showing promise for
use in the development of fault-tolerant software, including design diver-
sity (parallel coding), data diversity (e.g., e-copy programming), and envi-
ronmental diversity (proactive fault management). (See the glossary in
Appendix B for definitions of these terms.)
Efforts to develop fault-tolerant software have necessitated attempts to
classify defects, acknowledging that different types of defects will likely
require different procedures or techniques to achieve fault tolerance. Con-
sider a situation where one has an availability model with hardware redun-
dancy and imperfect recovery software. Failures can be broadly classified
into recovery software failures, operating system failures, and application
failures. Application failures are often dealt with by passive redundancy,
using cold replication to return the application to a working state. Soft-
ware agings occurs when defect conditions accumulate over time, leading
to either performance degradation or software failure. It can be due to
deterioration in the availability of operating system resources, data corrup-
tion, or numerical error accumulation. The use of design diversity to ad-
dress software aging can often be prohibitively expensive. Therefore envi-
ronmental diversity, which is temporal or time-related diversity, may often
be the preferred approach.
Note that use of the term "software aging" is not universal; the problem under discus-
sion is also considered a case of cumulative failure.
OCR for page 45
DATA ANALYSIS
45
One particular type of environmental diversity, software rejuvenation,
which was described at the workshop by Kishor Trivedi, is restarting an
,. . . . . ,. . ~ . . .
application to return to an lnltlallzlng state. Rejuvenation incurs some
costs, such as downtime and lost transactions, and so an important research
issue is to identify optimal times for rejuvenation to be carried out. There
are currently two general approaches to scheduling rejuvenation: those
based on analytical modeling, and those based on measurement-based (em-
pirical, statistical) rejuvenation. In analytical modeling, transactions are
assumed to arrive according to a homogeneous Poisson process; they are
queued and the buffer is of finite size. Transactions are served by an as-
sumed nonhomogeneous Poisson process (NHPP) and the software failure
process is also assumed to be NHPP. Based on this model, two rejuvena-
tion strategies that have been proposed are a time-based approach (restart
the application every to time periods), and a load- and time-based approach.
A measurement-based approach to scheduling rejuvenation attempts
to directly detect "aging." In this model, the state of operating system
resources is periodically monitored and data are collected on the attributes
responsible for the performance of the system. The effect of aging on
system resources is quantified by constant measurement of these attributes,
typically through an estimation of the expected time to exhaustion. Again,
two approaches have been suggested for use as decision rules on when to
restart an application. These are time-based estimation (see, e.g., Garg et
al., 1998) and workload-based estimation (see, e.g., Vaidyanathan and
Trivedi, 19991. Time-based estimation is implemented by using nonpara-
metric regressions on time of attributes such as the amount of real memory
available and file table size. Workload-based estimation uses cluster analy-
sis based on data on system workload (cpuContextSwitch, sysCall, pageIn,
pageOut) in order to identify a small number of states of system perfor-
mance. Transitions from one state to another and sojourn times in each
state are modeled using a Markov chain model. The resulting model can
be used to optimize some objective function as a function of the decision
rule on when to schedule rejuvenation; one specific method that accom-
plishes this is the symbolic hierarchical automated reliability and perfor-
mance estimator.
DEFECT CLASSIFICATION AND ANALYSIS
It is reasonable to expect that the information collected on field perfor-
mance of a software system should provide useful information about both
OCR for page 46
46
INNOVATIONS IN SOFTWARE ENGINEERING
the number and the types of defects that the software contains. There are
now efforts to utilize this information as part of a feedback loop to improve
the software engineering process for subsequent systems. A leading ap-
proach to operating this feedback loop is referred to as orthogonal defect
classification (ODC), which was described at the workshop by its devel-
oper, Ram Chillarege. ODC, created at IBM and successfully used at
Motorola, Telcordia, Nortel, and Lucent, among others, utilizes the defect
stream from software testing as a source of information on both the soft-
ware product under development and the software engineering process.
Based on this classification and analysis of defects, the overall goal is to
improve not only project management, prediction, and quality control by
various feedback mechanisms, but also software development processes.
Using ODC, each software defect is classified using several categories
that describe different aspects of the defects (see Dalal et al., 1999, for
details). One set of dimensions that has been utilized by ODC is as fol-
lows: (1) life cycle phase when the defect was detected, (2) the defect trig-
ger, i.e., the type of test of activity (e.g., system test, function test, or review
inspection) that revealed the defect, (3) the defect impact (e.g., on instabil-
ity, integrity/security, performance, maintenance, standards, documenta-
tion, usability, reliability, or capability), (4) defect severity, (5) defect type,
i.e., the type of software change that needed to be made, (6) defect modifier
(either missing or incorrect), (7) defect source, (8) defect domain, and (9)
fault origin in requirements, design, or implementation. ODC separates
the development process into various periods, and then examines the nine-
dimensional defect profile by period to look for significant changes. These
profiles are linked to potential changes in the system development process
that are likely to improve the software development process. The term
orthogonal in ODC does not connote mathematical orthogonality, but
simply that the more separate the dimensions used, the more useful they
are for this purpose.
Vic Basili (University of Maryland), in an approach similar to that of
ODC, has also examined the extent to which one can analyze the patterns
of defects made in a company's software development in order to improve
the development process in the future. The idea is to support a feedback
loop that identifies and examines the patterns of defects, determines the
leading causes of these defects, and then identifies process changes likely to
reduce rates of future defects. Basili refers to this feedback loop as an
cc · r ''
experience factory.
OCR for page 47
DATA ANALYSIS
47
Basili makes distinctions among the following terms. First, there are
errors, which are made in the human thought process. Seconcl, there are
faults, which are incliviclual, concrete manifestations of the errors within
the software; one error may cause several faults ancl different errors may
cause identical faults. Thircl, there are failures, which are departures of the
operational software system behavior from user expectations; a particular
failure may be caused by several faults, ancl some faults may never result in
a failure.2
The experience factory model is an effort to examine how a software
development project is organized ancl carried out in order to understand
the possible sources of errors, faults, ancl failures. Data on system perfor-
mance are analyzed ancl synthesized to develop an experience base, which is
then used to implement changes in the approach to project support.
Experience factory is oriented by two goals. The first is to build
baselines of defect classes; that is, the problem areas in several software
projects are identified ancl the number ancl origin of classes of defects as-
sessecl. Possible defect origins include requirements, specification, clesign,
cocling, unit testing, system testing, acceptance testing, ancl maintenance.
In addition to origin, errors can also be classified according to algorithmic
fault; for example, problems can exist with control flow, interfaces, ancl
,~ . . . . . . .
c Data ~ ~etlnltlon, 1nltla .lzatlon, or use.
Once this categorization is carried out ancl the error distributions by
error origin ancl algorithmic fault are well unclerstoocl, the second goal is to
find alternative processes that minimize the more common clefects. Hy-
potheses concerning methods for improvement can then be evaluated
through controlled experimentation. This part of the experience factory is
relatively nonalgorithmic. The result might be the institution of cleanroom
techniques or greater emphasis on understanding of requirements, for ex-
ample.
By using experience factory models in varying areas of application,
Basili has discovered that different software development environments have
very distinct patterns of clefects. Further, various software engineering tech-
niques have different degrees of effectiveness in remedying various types of
error. Therefore, experience factory has the potential to provide important
improvements for a wide variety of software development environments.
2In the remainder of this report, the term defect is used synonymously with failure.
OCR for page 48
48
INNOVATIONS IN SOFTWARE ENGINEERING
BAYESIAN INTEGRATION OF PROJECT DATA AND
EXPERT JUDGMENT IN PARAMETRIC SOFTWARE COST
ESTIMATION MODELS
Another use of system performance data is to construct parametric
models to estimate the cost and time to develop upgrades and new software
systems for related applications. These models are used for system scoping,
contracting, acquisition management, and system evolution. Several cost-
schedule models are now widely used, e.g., Knowledge Plan, Price S. SEER,
SLIM, COCOMO II: COSTAR, Cost Xpert, Estimate Professional, and
USC COCOMO II.2000. The data used to support such analyses include
the size of the project, which is measured in anticipated needed lines of
code or function points, effort multipliers, and scale factors.
Unfortunately, there are substantial problems in the collection of data
that support these models. These problems include disincentives to pro-
vide the data, inconsistent data definitions, weak dispersion and correlation
effects, missing data, and missing information on the context underlying
the data.
Data collection and analysis are further complicated by process and
product dynamism, in particular the receding horizon for product utility,
and software practices that do not remain static over time. (For example,
greater use of evolutionary acquisition complicates the modeling approach
used in COCOMO II.) As a system proceeds in stages from a component-
based system to a commercial-off-the-shelf system to a rapid application
development system to a system of systems, the estimation error of these
types of models typically reduces as a system moves within a stage but
typically increases in moving from one stage to another.
If these problems can be overcome, Barry Boehm (University of South-
ern California FUSC]), reporting on joint work with Bert Steece, Sunita
Chulani, and Jongmoon Balk, demonstrated how parametric software esti-
mation models can be used to estimate software CoStS. The steps in the
USC Center for Software Engineering modeling methodology are: (1) ana-
lyze the existing literature, (2) perform behavioral analyses, (3) identify the
relative significance of various factors, (4) perform expertjudgment Delphi
assessment and formulate an a priori model, (5) gather project data, (6)
determine a Bayesian a posterior) model, and (7) gather more data and
refine the model. COCOMO II demonstrates that Bayesian models can be
effectively used, in a regression framework, to update expert opinion using
data from the costs to develop related systems. Using this methodology,
OCR for page 49
Representative terms from entire chapter:
software development
DATA ANALYSIS
49
COCOMO II has provided predictions that are typically within 30 percent
of the actual time and cost needed.
COCOMO II models the dependent variable, which is the logarithm
of effort, using a multiple linear regression model. The specific form ofthe
model is:
ln(PM) = ,130 + ,131 1.01 ln(Size) + f2SF1 ln(Size) + . . . + ,136 SF5 ln(Size)
+ f7 . 1n
50
INNOVATIONS IN SOFTWARE ENGINEERING
demonstrated four applications of such data: estimation of software risk,
estimation of the parameters of a fault-tolerant software procedure, cre-
ation of a feedback loop to improve the software development process, and
estimation of the costs of the development of future systems. These appli-
cations provide only a brief illustration of the value of data collected on
software functioning. The collection of such data in a way that facilitates
its use for a wide variety of analytic purposes is obviously extremely worth-
while.