6
Safety and Reliability Assessment Methods

INTRODUCTION

Appropriate methods for assessing (as distinct from achieving or assuring) safety and reliability are the key to establishing the acceptability of digital instrumentation and control (I&C) systems in nuclear plants. Methods must be available to support estimates of reliability, assessments of safety margins, comparisons of performance with regulatory criteria such as quantitative safety goals, and overall assessments of safety in which trade-offs are made on the basis of the relative importance of disparate effects such as improved self-checking acquired at the cost of increased complexity. These methods must be sufficiently robust, justified, and understandable to be useful in assuring the public that using digital I&C technology in fact enhances public safety.

Statement of the Issue

Effective, efficient methods are needed to assess the safety and reliability of digital I&C systems in nuclear power plants. These methods are needed to help avoid potentially unsafe or unreliable applications and aid in identifying and accepting safety-enhancing and reliability-enhancing applications. What methods should be used for making these safety and reliability assessments of digital I&C systems?

Discussion

In nuclear power plants, reliability and safety are assessed using an interactive combination of deterministic and probabilistic techniques. The issues that the committee considered were the extent to which these assessment methods are applicable to digital I&C systems and the appropriate use of these methods.

Deterministic Techniques

Design basis accident analysis is a deterministic assessment of the response of the plant to a prescribed set of accident scenarios. This specific analysis constitutes a major section of the nuclear plant safety analysis report that is submitted to and reviewed by the U.S. Nuclear Regulatory Commission (USNRC) in the licensing process. In a design basis accident analysis an agreed-upon set of transient events are imposed on analytical simulations of the plant. Then, assuming defined failures, the plant systems must be shown to be effective in keeping the plant within a set of defined acceptance criteria. Consider, for example, the analysis of the thermal response of the reactor following a postulated pipe rupture. In this case, the deterministic safety analysis considers:

  • the size of the rupture (the cross-sectional area of the pipe)

  • the geometry of the systems and components affected, such as volumes and elevations of pipes and vessels

  • the initial conditions (conditions at the time of the rupture), such as initial power, pressures, and temperatures

  • the response logic of the active and passive safety systems, such as the sensing of the event by the instrumentation systems, the subsequent actuation of valves that isolate the fault, and the subsequent opening of backup feedwater system valves

All these considerations are used as parameters or forcing functions in the equations that model the physical behavior of the affected systems (mainly nuclear, thermal, mass, and momentum conservation equations) to calculate the response of the system. Of particular importance is the calculation of the resultant pressures and temperatures in the cooling systems and in the core to assess the integrity of the fuel and the multiple physical barriers that contain radionuclides.

Probabilistic Techniques

Probabilistic risk assessment (PRA) (or probabilistic safety assessment [PSA]) techniques are used to assess the relative effects of contributing events on system-level safety or reliability. Probabilistic methods provide a unifying means of assessing physical faults, recovery processes, contributing effects, human actions, and other events that have a



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 52
6 Safety and Reliability Assessment Methods INTRODUCTION Appropriate methods for assessing (as distinct from achieving or assuring) safety and reliability are the key to establishing the acceptability of digital instrumentation and control (I&C) systems in nuclear plants. Methods must be available to support estimates of reliability, assessments of safety margins, comparisons of performance with regulatory criteria such as quantitative safety goals, and overall assessments of safety in which trade-offs are made on the basis of the relative importance of disparate effects such as improved self-checking acquired at the cost of increased complexity. These methods must be sufficiently robust, justified, and understandable to be useful in assuring the public that using digital I&C technology in fact enhances public safety. Statement of the Issue Effective, efficient methods are needed to assess the safety and reliability of digital I&C systems in nuclear power plants. These methods are needed to help avoid potentially unsafe or unreliable applications and aid in identifying and accepting safety-enhancing and reliability-enhancing applications. What methods should be used for making these safety and reliability assessments of digital I&C systems? Discussion In nuclear power plants, reliability and safety are assessed using an interactive combination of deterministic and probabilistic techniques. The issues that the committee considered were the extent to which these assessment methods are applicable to digital I&C systems and the appropriate use of these methods. Deterministic Techniques Design basis accident analysis is a deterministic assessment of the response of the plant to a prescribed set of accident scenarios. This specific analysis constitutes a major section of the nuclear plant safety analysis report that is submitted to and reviewed by the U.S. Nuclear Regulatory Commission (USNRC) in the licensing process. In a design basis accident analysis an agreed-upon set of transient events are imposed on analytical simulations of the plant. Then, assuming defined failures, the plant systems must be shown to be effective in keeping the plant within a set of defined acceptance criteria. Consider, for example, the analysis of the thermal response of the reactor following a postulated pipe rupture. In this case, the deterministic safety analysis considers: the size of the rupture (the cross-sectional area of the pipe) the geometry of the systems and components affected, such as volumes and elevations of pipes and vessels the initial conditions (conditions at the time of the rupture), such as initial power, pressures, and temperatures the response logic of the active and passive safety systems, such as the sensing of the event by the instrumentation systems, the subsequent actuation of valves that isolate the fault, and the subsequent opening of backup feedwater system valves All these considerations are used as parameters or forcing functions in the equations that model the physical behavior of the affected systems (mainly nuclear, thermal, mass, and momentum conservation equations) to calculate the response of the system. Of particular importance is the calculation of the resultant pressures and temperatures in the cooling systems and in the core to assess the integrity of the fuel and the multiple physical barriers that contain radionuclides. Probabilistic Techniques Probabilistic risk assessment (PRA) (or probabilistic safety assessment [PSA]) techniques are used to assess the relative effects of contributing events on system-level safety or reliability. Probabilistic methods provide a unifying means of assessing physical faults, recovery processes, contributing effects, human actions, and other events that have a

OCR for page 52
high degree of uncertainty. These analyses are typically performed using fault tree analysis; but other methods, such as event trees, reliability block diagrams, and Markov methods, are also appropriate. In PRA, the probability of occurrence of various end events, both acceptable and unacceptable, is calculated from the probabilities of occurrence of basic events (usually failure events). For example, the USNRC has established a quantitative safety goal that the probability of a core damage event shall not exceed 10-5 per reactor year. The results of a particular PRA, of course, have wide bands of uncertainty; but on a relative basis they allow searching out the most important failure modes (''weak points") and allow the designer to balance the design appropriately between mitigation and prevention and to avoid unhealthy dependence on single systems or components. The development of a fault tree model serves several important purposes. First, it provides a logical framework for analyzing the failure behavior of a system and for precisely documenting which failure scenarios have been considered and which have not. Second, the fault tree model has well-defined Boolean algebraic and probabilistic basis that relates probability calculations to Boolean logic functions. That is, a fault tree model not only shows how events can combine to cause the end (or top) event, but at the same time defines how the probability of the end event is calculated as a function of the probabilities of the basic events. Thus the fault tree model can evolve as the system evolves and can at any time evaluate the effect of proposed changes on the reliability and safety of the nuclear power plant. In this manner the fault tree analysis can be used to support engineering tasks such as illuminating the design "weak points," facilitating trade-off analyses, or assessing relative risks. As mentioned above, the probabilistic analysis of reliability and safety is dependent upon an assignment of a probability of occurrence for each basic event in the fault tree. In addition to addressing the probability of an event, however, probability analysis may also address probabilities of variability and uncertainty. For example, an estimation may be made of the probability that a component will fail (probability of an event). But this failure probability may vary as a result of statistical variation in external conditions, such as temperature, or statistical characteristics of the source of the component. A second probability concept describes this variation as a probability distribution around a "point estimate" for the failure probability. Furthermore, the failure probability may not be known with perfect confidence. A third probability concept uses a distribution to express the degree of uncertainty associated with the point estimate reflecting the differences and uncertainties among experts solicited for judgments on probabilities (see below). Thus current risk assessment practice distinguishes between probabilities of events, variability, and uncertainty (NRC, 1994). An uncertainty analysis using the fault tree model reflects the degree to which the output value is affected by the uncertainty in an input. This analysis helps the designer determine the extent to which an unknown input can affect the reliability or safety of the system and thus the extent to which the system must be able to withstand such uncertainty (Modarres, 1993). But the fundamental concept in probabilistic analysis remains the concept of the probability of an event. There are several interpretations of the probability associated with an event (Cooke, 1991; Cox, 1946; McCormick, 1981; Modarres, 1993). The classic notion of the probability of an event is the ratio of the size of the subspace of sample points that include the event to the size of the sample space. A frequency interpretation of the probability of an event is the one most commonly understood; it defines the probability of an event as the limit of the ratio of the number of such events observed to the number of trials as the number of trials becomes large. Many events considered in a probabilistic safety assessment in the nuclear field are, however, classifiable as rare events, which complicates the estimation of occurrence probabilities for the basic events. If failure probabilities are to be estimated from life testing or field experience, many samples must be studied over long periods of time in order to gain any statistical significance in the data (Leemis, 1995). Several databases and handbooks exist to help with the estimation of failure probabilities for basic events (Bellcore, 1992; DOD, 1991; Gertman and Blackman, 1994; RAC, 1995). Within the nuclear engineering community, failure data for nuclear-specific systems and components are available from several sources, including summaries of licensee event reports (USNRC, 1980, 1982a, 1982b) and other handbooks (IEEE, 1983; USNRC, 1975). The existence and use of such handbooks helps address the problems associated with obtaining failure data for many of the basic events. But for some basic events, where there are few or no applicable data on frequencies, subjective interpretations of probability may be used and may, in fact, be all that is available. Subjective probabilities may be sought in formal and informal processes in which groups of experts weigh available evidence and make judgments. This approach to probability is not of course based on relative frequencies and does not require samples or trials except as they may be available to inform subjective engineering judgment. Rather, subjective interpretations are commonly described as measures of the degree of belief that an event will occur. For example, Apostolakis (1990) states that "probability is a measure of belief." He continues: "The primitive notion is that of 'more likely': that is, we can intuitively say that event A is more likely than event B. Probability is simply a numerical expression for this likelihood." However, as more information becomes available, the subjective distribution (see discussion of uncertainty analysis above) can be adjusted to reflect the current state of knowledge. There is extensive experience in nuclear risk studies and elsewhere with such elicitation of expert judgments on probabilities. Bayesian analysis (Leemis, 1995) tells how past observations (i.e., frequency data) influence the subjective

OCR for page 52
judgment. Certain characteristic biases, such as tendencies toward overconfidence, are known to occur (Cooke, 1991). Notwithstanding its limitations, the subjective interpretation of probability is the usual basis for the analysis of rarely occurring events and forms the basis of many risk evaluations (McCormick, 1981). As such, it is important to the committee's consideration of the applicability of probabilistic analysis to digital systems. Hazard analysis (i.e., experts thinking about what might go wrong) has been validated as effective for at least 50 years. Random testing has been suggested as an alternate approach. However, truly random testing is not particularly good for finding hazards as it is more of a "needle-in-a-haystack" approach. Tests might also be randomly generated from an abstract description of a rare-event scenario. However, significant expertise is needed to formulate such a description. Applicability to Digital Systems Deterministic analysis techniques for digital systems are a generalization of the design basis accident methodology used in the nuclear industry and include such techniques as hazard analysis and formal methods (Leveson, 1995; Rushby, 1995). The use of deterministic analysis techniques for the analysis of digital systems is not controversial, as long as they are applied with care to consider the failure modes attributable to digital systems. More controversial is the applicability of probabilistic models to digital systems. The committee spent much of its effort on this issue in assessing the applicability of probabilistic analysis methods to digital systems. Although well-accepted techniques exist for the analysis of physical faults, probabilistic analysis of design faults in critical systems is more problematic. Because software faults are by definition design faults, the discussion will focus on probabilistic techniques for assessing software. It should be noted that much of the discussion is applicable to similar systems that may be implemented in hardware, using programmable devices or application-specific integrated circuits. There is controversy within the software engineering community as to whether software fails, whether it fails randomly, and whether the notion of a software failure rate exists. Some would assert that software does not "fail" because it does not physically change when an incorrect result is produced. Others assert that software either works or does not work, and thus its reliability is either zero or one (see, e.g., Singpurwalla, 1995, and the published discussion accompanying that reference). Some who accept the notion of software failure disagree as to whether software failure can be modeled probabilistically. Some argue that software is deterministic, in that given a particular set of inputs and internal state, the behavior of the software is fixed. The most common justification for the apparent random nature of software failures is the randomness or uncertainty of the input sequences (Eckhardt and Lee, 1985; Laprie, 1984; Littlewood and Miller, 1989). For example, Finelli (1991) identifies "error crystals" (regions of the input space that cause a program to produce errors); a software failure occurs when the input trajectory enters an error crystal. Recent experimental work (Goel, 1996) seems to suggest that the reliability of some software can be modeled stochastically as a function of the workload. For nonsafety-critical software systems, statistical analysis techniques are being used in the software reliability engineering process (Lyu, 1996). For example, the statistical analysis of the results (i.e., detected failures) of a good set of tests can, based on the operational profile, help managers answer questions such as "When can I release this version?" or "When can I consider this phase of testing complete?" The basic premise is that a set of random tests of a large software system provides data as to the probability of failure for a particular version of software. Many of the methods developed for software reliability engineering of large-scale commercial systems are not directly applicable to embedded systems for critical applications. One problem with the software reliability engineering approaches is that a very large number of test cases must be considered to statistically validate a low probability of failure (Butler and Finelli, 1993). For very reliable software, the software would be expected to pass every test, making statistical analysis even more difficult. If software for a safety-critical application were to fail a test, the software would be changed in such as way as to correct the error and the testing would be restarted. Thus, a point would be reached when the software would have passed a very large number of tests. Miller et al. (1992) describe several methods for estimating a probability of failure for software that, in its current version, has not failed during random testing. Bertolino and Strigini (1996) propose a method for estimating both the probability of failure and the probability of program correctness from a series of failure-free test executions. Parnas et al. (1990) describe a methodology for determining how many tests should be passed in order to achieve a certain level of confidence that the failure probability is below a specified upper bound. A similar approach is described in NUREG/CR-6113 (USNRC, 1993a). In this case, the operating range of a safety system is considered to be the transition region between safe and unsafe operation. Thus it is recommended that random tests be selected in this transition region, and a mathematical formula is given for determining the number of test cases needed for statistical confidence that the failure probability is below a given upper bound. The validity of these methods is dependent on the quality of the test cases chosen. The test cases should be representative of the inputs encountered in practice and should certainly include all boundary conditions and known potentially hazardous cases. Random testing should, however, be only a

OCR for page 52
part of a complete program for safety assessment and quality assurance, a program that includes formal methods (Rushby, 1995) or other analysis techniques throughout the development and assurance process. Testing and formal methods, besides being complementary, can be mutually supportive as well. Analysis can help determine potentially hazardous conditions that should be tested, and testing can help validate critical assumptions made in the analysis (Walter, 1990). Some failure data from operational systems in the nuclear and other industries are available (Paula, 1993). Failure rates for microprocessor-based programmable logic controllers used in emergency shutdown systems are reported by Mitchell and Williams (1993). Fault-tolerant digital control systems failures are analyzed by Paula et al. (1993), who also present a quantitative fault tree analysis that helped a group of owners decide whether to replace existing analog control systems with fault-tolerant digital control systems. In 90 system years of operation, 279 single-channel failures and 55 multiple-channel failures were reported. Of the 55 multiple-channel failures, nine were attributed to software deficiencies. The fault tree analysis included such failure modes as inadvertent operator actions, software failures, physical damage from external events, lack of coverage, and hardware component and communication failures. CURRENT U.S. NUCLEAR REGULATORY COMMISSION REGULATORY POSITION AND PLANS The criteria under which a utility can make plant changes without prior USNRC approval are established in 10 CFR 50.59. One of the specified criteria for determining whether a change requires approval (i.e., involves an unreviewed safety question) is whether the probability of occurrence or the consequences of an accident or malfunction of equipment important to safety previously evaluated in the safety analysis report may be increased. The USNRC is increasingly incorporating probabilistic risk assessment into all of its rulemaking activities as it develops a risk-informed, performance-based stance (Newman, 1995). The current USNRC regulatory position on the probabilistic analysis of digital systems, however, is not clearly established or well documented. In an October 1995 presentation to the committee, USNRC staff described their position as follows (USNRC, 1995a): "It is the responsibility of the licensees to ensure appropriate reliability and safety of the digital I&C system. The design life-cycle activities permit both qualitative and quantitative methodologies for assessing reliability and are sufficiently adaptable to consider the evolving aspect of digital technology." However, although qualitative software assurance techniques are presented in several NUREG publications prepared by the Lawrence Livermore National Laboratory (USNRC, 1993b, 1995b), these contain no discussion of probabilistic analysis. In fact, in the October 1995 presentation, the USNRC staff also stated that "quantitative reliability assessment methods for digital systems are not believed to be sufficiently developed to be acceptable as standard practice" (USNRC, 1995a). In further discussions with the committee in April 1996, in addressing the evaluation of relative frequencies of occurrence for use in 10 CFR 50.59 determinations, the USNRC staff indicated they did not consider current evaluation methods to be sufficiently accurate to be meaningful (USNRC, 1996b). DEVELOPMENTS IN THE U.S. NUCLEAR INDUSTRY In the U.S. nuclear industry, the use of probabilistic analysis for digital systems (particularly software) is mixed. The analysis of a fault-tolerant digital control system by Paula et al. (1993) used a fault tree and included software failures; however, this approach is not common. A discussion of key assumptions and guidelines for PRA from the Electric Power Research Institute's Utility Requirements Document (EPRI, 1992) shows no mention of software or of failure modes peculiar to digital systems. When several industry representatives were asked by the committee about the use of probabilistic analysis, the responses were mixed or inconclusive. Asked about the probabilistic risk assessment for the General Electric (GE) Advanced Boiling Water Reactor design, the GE representative told the committee that the GE analysis assumed that the software quality assurance and V&V (verification and validation) methodologies addressed the software failure issue (Simon, 1996). Thus software failures were not explicitly included in the PRA. However, it is interesting to note that the PRA for the protection and safety monitoring system of the (Westinghouse) AP600 used a software common-mode unavailability of 1.1 × 10-5 failures per demand for any particular software module, and a software common mode unavailability of 1.2 × 10-6 failures per demand for software failures that would manifest themselves across all types of software modules derived from the same basic design program in all applications (Westinghouse/ENEL, 1992). DEVELOPMENTS IN THE FOREIGN NUCLEAR INDUSTRY As discussed in earlier chapters, the Canadian Atomic Energy Control Board (AECB) is currently formalizing an approach for software assessment in a new regulatory guide (AECB, 1996). The AECB assessment of software focuses on four aspects: review of software requirements specifications, systematic inspection of software development and implementation, review of software testing, and confirmation of software development process and management. The AECB approach requires an analysis of software criticality to assess the role of software in plant safety. A probabilistic analysis is not required since it "is difficult to produce a

OCR for page 52
statistically valid set of accident conditions that a protection system must guard against. However, we maintain that usage testing can build confidence in the reliability of the software (as long as no failures occur)" (Taylor and Faya, 1995). In the United Kingdom, Nuclear Electric is carrying out extensive dynamic testing of at least substantial portions of Sizewell-B's software as part of its safety case for the reactor's primary protection system. A quantification of the reliability was reportedly not required for licensing, but Nuclear Electric has decided to continue the testing to more accurately estimate the reliability of its software as part of its research and development activity (Marshall, 1995). DEVELOPMENTS IN OTHER SAFETY-CRITICAL INDUSTRIES In other safety-critical industries, the use of deterministic safety analysis methods is prevalent; the use of probabilistic analysis is mixed. The Federal Aviation Administration relies heavily on the use of the DO-178B standard for software quality assurance (Software Considerations in Airborne Systems and Equipment Certification) and eschews the use of a probabilistic assessment of software failure. A representative from a developer of railway control systems reported to the committee on the use of formal methods in his industry for safety assessment (requirements analysis, hazard analysis, failure modes and effects analysis), abstract modeling (Petri nets, VHDL simulations, Markov models) and detailed experimental fault injection (Profetta, 1996). Within the rail industry there is a trend towards the use of a PRA-based analysis, raising for that industry many of the same issues facing the nuclear industry. The manager of software engineering at a developer of implantable devices for cardiac rhythm management described his company's system development process, which included safety and reliability assessment and V&V at each stage (Elliott, 1996). Specification analysis included data flow diagrams, state charts, and other formal methods. Quantitative analysis included extensive use of field data and an assessment of the importance of software failure to overall system safety. ANALYSIS Techniques for deterministic analysis of safety and reliability are well accepted and are applicable to digital systems. Formal methods are not currently used widely but offer a good basis for safety analysis of digital systems (Leveson, 1995; Rushby, 1995). When considering a probabilistic analysis of a system containing digital components, there are basically three choices available to the analyst. First, one can estimate a probability of failure for the digital system, including software, using the best known data and the results of statistically meaningful random tests. An uncertainty analysis can help to minimize the dependence on an uncertain input for the achievement of a reliability or safety goal. The second available choice is to assume that either the software does not fail or that it always fails. This first assumption (that it does not fail) is the assumption that coincides with not including the software in the fault tree. Alternatively, one could assume that the software will certainly fail, assign a probability of one, and design the system to survive such a failure. Many analysts, who are hesitant to model software probabilistically, leave the software out of the fault tree. Since this omission is equivalent to assuming that the software does not fail, the result may be unduly optimistic. However, if the analyst can subjectively determine a reasonable upper bound on the probability of failure (i.e., by the use of quality assurance techniques and statistically meaningful random testing), the resulting analysis may be more meaningful. The third choice is to abandon the use of probabilistic analysis for reliability and safety of a nuclear power plant entirely. This third choice seems impractical, as a PRA is a key component of nuclear power plant safety analysis and has been used effectively. However, if traditional fault tree analysis is used in PRA, it must be recognized that it is limited in its ability to model some of the failure modes associated with digital systems, especially those that incorporate fault tolerance. There are also other methods available. For example, Markov methods are generally accepted as an appropriate method for analyzing fault-tolerant digital systems (Johnson, 1989), and some mention of Markov models has appeared in the nuclear literature (Bechta Dugan et al., 1993; Sudduth, 1993). But their use appears limited within the nuclear community. Although Markov models are more flexible than fault tree models and are useful for modeling various sequence dependencies, common-cause failures, and failure event correlations, they have the disadvantage of being hard to specify and requiring very long solution times for large models. Recent work (Bechta Dugan et al., 1992) has expanded the applicability of fault tree models to adequately handle the complexities associated with the analysis of fault-tolerant systems without necessitating the specification of a complex Markov model. This dynamic fault tree model integrates well with a traditional fault tree analysis of other parts of the system (Pullum and Bechta Dugan, 1996). In addition to the extensions of the fault tree model, other analysis techniques have been proposed, for example, dynamic flow graphs (Garrett et al., 1995; USNRC, 1996a). Further, fault-tolerant digital systems are known to be susceptible to "coverage failures," which are a type of common-cause failure that can bring down the entire system on a single failure. Coverage failures have been shown to dramatically affect the reliability analysis of highly reliable systems (Arnold, 1973; Bechta Dugan and Trivedi, 1989; Bouricius et al., 1969) and so it is important to include them in a model. Paula (1993) provides data for coverage failures in PLC systems used in the chemical process and nuclear power industries.

OCR for page 52
CONCLUSIONS AND RECOMMENDATIONS Conclusions Conclusion 1. Deterministic assessment methodologies, including design basis accident analysis, hazard analysis, and other formal analysis procedures, are applicable to digital systems. Conclusion 2. There is controversy within the software engineering community as to whether an accurate failure probability can be assessed for software or even whether software fails randomly. However, the committee agreed that a software failure probability can be used for the purposes of performing probabilistic risk assessment (PRA) in order to determine the relative influence of digital system failure on the overall system. Explicitly including software failures in a PRA for a nuclear power plant is preferable to the alternative of ignoring software failures. Conclusion 3. The assignment of probabilities of failure for software (and more generally for digital systems) is not substantially different from the handling of many of the probabilities for rare events. A good software quality assurance methodology is a prerequisite to providing a basis for the generation of bounded estimates for software failure probability. Within the PRA, uncertainty and sensitivity analysis can help the analyst assure that the results are not unduly dependent on parameters that are uncertain. As in other PRA computations, bounded estimates for software failure probabilities can be obtained by processes that include valid random testing and expert judgment. 1 Conclusion 4. Probabilistic analysis is theoretically applicable in the same manner to commercial off-the-shelf (COTS) equipment, but the practical application may be difficult. The difficulty arises when attempting to use field experience to assess a failure probability, in that the experience may or may not be equivalent. For programmable devices, the software failure probability may be unique for each application. However, a set of rigorous tests may still be applicable to bounding the failure probability, as with custom systems. A long history of successful field experience may be useful in eliciting expert judgment. Recommendations Recommendation 1. The USNRC should require that the relative influence of software failure on system reliability be included in PRAs for systems that include digital components. Recommendation 2. The USNRC should strive to develop methods for estimating the failure probabilities of digital systems, including COTS, for use in probabilistic risk assessment. These methods should include acceptance criteria, guidelines and limitations for use, and any needed rationale and justification.2 Recommendation 3. The USNRC and industry should evaluate their capabilities and develop a sufficient level of expertise to understand the requirements for gaining confidence in digital implementations of system functions and the limitations of quantitative assessment. Recommendation 4. The USNRC should consider support of programs that are aimed at developing advanced techniques for analysis of digital systems that might be used to increase confidence and reduce uncertainty in quantitative assessments. REFERENCES AECB (Atomic Energy Control Board, Canada). 1996. Draft Regulatory Guide C-138 Software in Protection and Control Systems. Ottawa, Ontario: AECB. Apostolakis, G. 1990. The concept of probability in safety assessments of technological systems. Science 250(Dec. 7):1359–1364. Arnold, T.F. 1973. The concept of coverage and its effect on the reliability model of a repairable system. IEEE Transactions on Computers (22)3:251–254. Bechta Dugan, J., and K.S. Trivedi. 1989. Coverage modeling for dependability analysis of fault-tolerant systems. IEEE Transactions on Computers 38(6):775–787. Bechta Dugan, J., S.J. Bavuso, and M.A. Boyd. 1992. Dynamic fault tree models for fault tolerant computer systems. IEEE Transactions on Reliability 41(3):363–377. Bechta Dugan, J., S.J. Bavuso, and M.A. Boyd. 1993. Fault trees and Markov models for reliability analysis of fault tolerant systems. Reliability Engineering and System Safety, 39:291–307. Bellcore. 1992. Reliability Prediction for Electronic Equipment. Report TR-NWT-000332, Issue 4, September. Bertolino, A., and L. Strigini. 1996. On the use of testability measures for dependability assessment. IEEE Transactions on Software Engineering (22)2:97–108. Bouricius, W.G., W.C. Carter, and P.R. Schneider. 1969. Reliability modeling techniques for self-repairing computer systems. Pp. 295–309 in Proceedings of the 24th Annual Association of Computing Machinery (ACM) National Conference, August 26-28, 1969. New York, N.Y.: ACM. Butler, R.W., and G.B. Finelli. 1993. The infeasibility of quantifying the reliability of life-critical real-time software. IEEE Transactions on Software Engineering 19(1):3–12. Cooke, R. 1991. Experts in Uncertainty: Opinion and Subjective Probability in Science. Oxford: Oxford University Press. Cox, R.T. 1946. Probability, frequency and reasonable expectation. American Journal of Physics 14(1):1–13. DOD (U.S. Department of Defense). 1991. Reliability Prediction of Electronic Equipment. Mil-Handbook-217F. New York: Griffiss Air Force Base. December, 1991. Eckhardt, D.E., and L.D. Lee. 1985. A theoretical basis for the analysis of multiversion software subject to coincident errors. IEEE Transactions on Software Engineering 11(12):1511–1517. 1   Committee member Nancy Leveson did not concur with this conclusion. 2   See also Chapter 8, Dedication of Commercial Off-the-Shelf Hardware and Software.

OCR for page 52
Elliott, L. 1996. Presentation to the Committee on Applications of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Washington, D.C., April 16. EPRI (Electric Power Research Institute). 1992. Advanced Light Water Reactor Utility Requirements Document, Appendix A. EPRI NP-6780-L. Palo Alto, Calif.: EPRI. Finelli, G.B. 1991. NASA software failure characterization experiments. Reliability Engineering and System Safety 32:155–169. Garrett, C.J., S.B. Guarro, and G. Apostolakis. 1995. The dynamic flowgraph methodology for assessing the dependability of embedded software systems. IEEE Transactions on Systems, Man and Cybernetics 25(5):824–840. Gertman, D.I., and H.S. Blackman. 1994. Human Reliability and Safety Analysis Data Handbook. New York: John Wiley and Sons. Goel, A. 1996. Relating operational software reliability and workload: Results from an experimental study. Pp. 167–172 in Proceedings of the 1996 Annual Reliability and Maintainability Symposium, Las Vegas, Nev., January 22–25, 1996. Piscataway, N.J.: Institute of Electrical and Electronics Engineers. IEEE (Institute of Electrical and Electronics Engineers). 1983. IEEE Guide to Collection and Presentation of Electrical, Electronic and Sensing Component and Mechanical Equipment Reliability Data for Nuclear Power Generating Stations, Std 500–1984. New York: IEEE. Johnson, B.W. 1989. Design and Analysis of Fault Tolerant Digital Systems. New York: Addison-Wesley. Laprie, J-C. 1984. Dependability evaluation of software systems in operation. IEEE Transactions on Software Engineering 10(6):701–714. Leemis, L.M. 1995. Reliability: Probabilistic Models and Statistical Methods. Upper Saddle River, N.J.: Prentice-Hall. Leveson, N. 1995. Safeware: System Safety and Computers. New York: Addison-Wesley. Littlewood, B., and D.R. Miller. 1989. Conceptual modeling of coincident failures in multiversion software. IEEE Transactions on Software Engineering 15(12):1596–1614. Lyu, M. (ed.) 1996. Handbook of Software Reliability Engineering. New York: McGraw-Hill. Marshall, P. 1995. NE Tries for Quantification of Software-based System. Inside NRC 17(20):9. McCormick, N.J. 1981. Reliability and Risk Analysis. San Diego: Academic Press. Miller, K.W., L.J. Morell, R.E. Noonan, S.K. Park, D.M. Nicol, B.W. Murrill, and J.M. Voas. 1992. Estimating the probability of failure when testing reveals no failures. IEEE Transaction on Software Engineering 18(1):33–43. Mitchell, C.M., and K. Williams. 1993. Failure experience of programmable logic controllers used in emergency shutdown systems. Reliability Engineering and System Safety 39:329–331. Modarres, M. 1993. What Every Engineer Should Know About Reliability and Risk Analysis. New York: Marcel Dekker. Newman, P. 1995. NRC Takes a Chance, Turns to Risk-based Regulation. The Energy Daily 23(168):3. NRC (National Research Council). 1994. Science and Judgment in Risk Assessment. Board on Environmental Studies and Toxicology. Washington, D.C.: National Academy Press. Parnas, D., A.J. van Schouwen, and S.P. Kwan. 1990. Evaluation of safety-critical software. Communications of the Association of Computing Machinery (33)6:636–648. Paula, H.M. 1993. Failure rates for programmable logic controllers. Reliability Engineering and System Safety 39:325–328. Paula, H.M., M.W. Roberts, and R.E. Battle. 1993. Operational failure experience of fault-tolerant digital control systems. Reliability Engineering and System Safety 39:273–289. Profetta, J. 1996. Presentation to the Committee on Applications of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Washington, D.C., April 16. Pullum, L.L., and J. Bechta Dugan. 1996. Fault tree models for the analysis of complex computer systems. Pp. 200–207 in Proceedings of the 1996 Annual Reliability and Maintainability Symposium. Las Vegas, Nev., January 22–25, 1996. Piscataway, N.J.: IEEE. RAC (Reliability Analysis Center). 1995. Nonelectronic Parts Reliability Data 1995. Rome, N.Y.: Reliability Analysis Center. Rushby, J. 1995. Formal Methods and Their Role in the Certification of Critical Systems. Technical Report CSL-95-1. Menlo Park, Calif.: SRI International. March. Simon, B. 1996. Presentation to the Committee on Application of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Irvine, Calif., February 28. Singpurwalla, N.D. 1995. The failure rate of software: Does it exist? IEEE Transactions on Reliability 44(3):463–466. Sudduth, A.L. 1993. Hardware aspects of safety-critical digital computer based instrumentation and control systems. NUREG/CP-0136. Pp. 81-104 in Proceedings of the Digital Systems Reliability and Nuclear Safety Workshop. U.S. Nuclear Regulatory Commission and the National Institute of Standards and Technology, September 13–14, 1993, Rockville, Md. Washington, D.C.: U.S. Government Printing Office. Taylor, R.P., and A.J.G. Faya. 1995. Regulatory Guide for Software Assessment. Presented at 2nd COG CANDU Computer Conference, Toronto, Ontario. October. USNRC (U.S. Nuclear Regulatory Commission). 1975. Reactor Safety Study: An Assessment of Accident Risks in U.S. Commercial Nuclear Power Plants. NUREG-75/014. USNRC report WASH-1400, October. Washington, D.C.: USNRC. USNRC. 1980. Data Summaries of Licensee Event Reports of Diesel Generators at U.S. Commercial Nuclear Power Plants . NUREG/CR-1362. Washington, D.C.: USNRC. March. USNRC. 1982a. Data Summaries of Licensee Event Reports of Pumps at U.S. Commercial Nuclear Power Plants. NUREG/CR-1205. Washington, D.C.: USRNC. January. USNRC. 1982b. Data Summaries of Licensee Event Reports of Valves at U.S. Commercial Nuclear Power Plants. NUREG/CR-1363. Washington, D.C.: USNRC. October. USNRC. 1993a. Class 1E Digital Systems Studies. NUREG/CR-6113. Washington, D.C.: USNRC. USNRC. 1993b. Software Reliability and Safety in Nuclear Protection Systems. NUREG/CR-6101. Washington, D.C.: USNRC. USNRC. 1995a. Presentation by USNRC staff (J. Wermeil) to the Committee on Application of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Washington, D.C. October. USNRC. 1995b. Verification and Validation Guidelines for High Integrity Systems. NUREG/CR-6293. Washington, D.C.: USNRC. USNRC. 1996a. Development of Tools for Safety Analysis of Control Software in Advanced Reactors. NUREG/CR-6465. S. Guarro, M. Yau, and M. Motamed. Washington, D.C.: USNRC. April. USNRC. 1996b. Presentation by USNRC staff (J. Wermeil) to the Committee on Application of Digital Instrumentation and Control Systems to Nuclear Power Plant Operations and Safety, Washington, D.C., April. Walter, C.J. 1990. Evaluation and design of an ultra-reliable distributed architecture for fault tolerance. IEEE Transactions on Reliability 39(5):492–499. Westinghouse/ENEL. 1992. Simplified Passive Advanced Light Water Reactor Plant Program–AP600 Probabilistic Risk Assessment. DE-AC03-90SF18495. Prepared for U.S. Department of Energy (DOE) by Westinghouse/ENEL. Washington, D.C.: DOE.