Click for next page ( 2


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
~ Executive Summary The Shuttle Criticality Review and Hazard Analysis Audit Committee (SCRHAAC) was formed by the National Research Council (NRC), at the request of the National Aeronautics and Space Adminis- tration (NASA), in response to a recommendation of the Presidential Commission on the Space Shuttle Challenger Accident (also known as the Rogers Commission). That Commission had recommended that NASA review and evaluate certain aspects of its process for ensuring the safety of the National Space Transportation System (NSTS), ant] that an NRC pane! be appointed to audit the NASA review effort and verify its adequacy. The Committee monitored the overall NASA review and evaluation effort while performing detailed on-site reviews of its implementation for selected elements and subsystems (e.g., the Space Shuttle Main Engine, Solid Rocket Booster, Aux- iliary Power Unit). As areas of particular concern emerged, such as software issues, the adequacy of Orbiter structural margins, integrated Space Trans- portation System (STS) analysis in support of risk assessment, and Orbiter steering on landing, the Committee pursued those concerns in greater detail. Various operational issues affecting Shuttle safety (e.g., the application of Launch Commit Criteria and the "cannibalization" of spare parts) were also examined. Each of these audits was conducted through a series of meetings with NASA and contractor personnel on-site at the contractor fa- cilities and NASA centers, and by reviewing avail- able documentation. In addition, two NASA liaison persons provided direct input on questions raised ~ There are four major flight "elements" in the Space Shuttle (Orbiter, Space Shuttle Main Engines, Solid Rocket Boosters, and External Tank), each of which is composed of several subsystems. 1 by the Committee on an ongoing basis and provided substantial reports on certain points of concern. The Committee appreciates that NASA has ac- complished the design, development, verification, and certification of the STS utilizing a management approach and procedures that have been, in large part, most successful. The Committee also recog- nizes that the risk assessment and management recommendations made in this report will only be useful if they are introduced in rational, practical stages. The Committee believes, however, that the safety of continuing operations of the STS can be improved by creating an integrated risk assessment and management program which builds on the largely qualitative methods used previously. The totality of the recommendations, once such a system is implemented, should be extremely valuable in the accomplishment of the NSTS Program in the future, and should serve as a prototype for similar programs in NASA as well. During the course of its work, the Committee produced two interim progress reports to the Ad- ministrator of NASA in which more than a dozen recommendations and suggestions were made. Some of the concerns expressed in the interim reports have been resolved since the reports were presented; others remain at issue. All of the concerns identified in those reports are reflected in the Findings and Recommendations summarized in Section I.3. Id. NASA'S SAFETY POLICY AND PROCESS NASA policy regarding safety is established by the Administrator; its essence (as stated in NASA Policy Directive ~ 701. ~ ~ is to: "a. Avoid loss of life, injury of personnel, damage and property loss.

OCR for page 1
c. " . Instill a safety awareness in all NASA employees and contractors. . Assure that an organized and systematic approach is utilized to identify safety hazards and that safety is fully considered from conception to completion of all . . . agency activities. "d. Review and evaluate plans, systems, and activities related to establishing and meeting safety requirements both by contractors and by NASA installations to ensure that desired ob jectives are effectively achieved." Every manager thoughout the organization is re- sponsible for systematically identifying risks, haz- arcis, or unsafe situations or practices, and for taking steps to assure adequate safety in the activ- ities and products under his supervision. Out of this broad policy framework are clerivec] the more specific safety requirements that are implemented in successively greater detail clown through HeacI- quarters, program, and project organizations at the NASA centers ant] contractors. The Committee finds that the basic documents setting forth these policies are complete and do establish a firm foundation for the NASA-wicle safety program. Central to NASA's analyses to ensure reliability of the Shuttle system is the Failure Modes and Effects Analysis (FMEA). FMEAs are performed on all STS flight hardware as well as Ground Support Equipment (GSE) which interfaces with flight hardware at the launch sites to identify hardware items that are critical to the performance ant] safety of the vehicle ant! the mission, and to identify items that do not meet design requirements. Each possible failure mode is i`dentifiec] and then analyzed to determine the resulting performance of the system and to ascertain the worst-case effect that couic] result from a failure in that mode. All the icientifiec! "critical items" are then categorizes] according to the worst-case effect of the failure on the crew, the vehicle, ant! the mission. If the worst- case effect is loss of life or vehicle, the item is categorizes! as Criticality ~ (IR if there are redun- ciant units, ant] IS if it wouIc! result from the failure of a piece of ground support equipment). In the same manner, Criticality 2 ant! 2R are cases where loss of mission could result. The result of this classification is a "Critical Items List" (CTL) which includes for each item the rationale for its retention on the STS, thus requiring a waiver of the NASA policy against flying with such items present. The retention rationale is the primary input to NASA waiver decisions to fly the Shuttle, exposing the STS and its crew to the risk implicit in the use of the analyzed critical item. The retention rationale is used to justify accepting the design "as is," in the Committee's view; its audits of the NASA review process cliscovered little emphasis on creative ways to eliminate potential failure mocles. The hazarc! analysis is another analytical too} used to identify anti, if possible, resolve hazardous conditions that couicl develop while operating ant! maintaining STS hardware and software. Hazard analyses consider not only the failures identifier! in the FMEA process, but also other potential threats poser! by the environment, crew-machine inter- faces, and mission activities. {clentifiec! hazards and their causes are analyzed to Sac! ways to eliminate or control the hazard. A hazard is said to be "eliminated" when its source has been removed. A "controller! hazard" is one that has effectively been controlled by a design change, acIdition of safety or warning crevices, procedural changes, or operational constraints. Any hazard that cannot feasibly be eliminated or controlled is termed an "acceptecl risk." There are many other analysis and assessment tools used by NASA. This complex mosaic of analysis techniques is intended to provide an all- encompassing approach to ensuring the design reliability and safety of the STS. Some of the techniques, such as the hazard analyses, tenet to be "top-down" approaches that examine certain cross- systems causes and effects. Others, such as FMEA/ CTE, are narrower "bottom-up" analyses that pur- sue a specific event to its conclusion but only with respect to the subsystem involved. In March 1986, soon after the Challenger acci- clent, direction was issued within NASA to reeval- uate the FMEAs on all critical items on the STS, "... to affirm the completeness and accuracy of the FMEA/CIL for the current National STS de- sign." Following reevaluation of the FMEA, each Criticality ~ and IR item, along with any new items, or items for which the reevaluation had lee! to a change in classification, was to be resubmitted for review and approval of the waiver permitting the item to be flown aboard the STS. Those items not revali(lated by the review were required to be re(lesignecI, certifiecl, and quatifiecl for flight. In acIdition to the FMEA/CIL reevaluation, the direc- tives stipulated that the hazard analyses and a set of special Element Interface Functional Analyses (ElFAs) were also to be reviewed for completeness and accuracy. 2

OCR for page 1
Since the Challenger mission 51-L acciclent, a substantial number of engineering changes have been undertaken to improve Shuttle safety prior to resumption of flight. The redesign activity has, for the most part, precedes! the FMEA/CIL and hazard analysis reevaluations. However, as the reevalua- tions proceeded, they disclosed a number of adcti- tional items which are being addressee] before the next flight. 1.2 THE COMMITTEE'S VIEW As the Challenger accident made very evident, space flight is not routine. Its risks must be accepted by those who are asked to participate in each flight as well as by those who are responsible to the nation for achieving our goals in space. The Com- mittee believes that the basis for NASA's acceptance of those risks should, as far as possible, stem from rationally derived criteria. This acceptance also should depend very heavily on the quality of the methodology ant] the degree of objectivity by which the risks are determinecI, as well as the rigor by which the risks are controller! (i.e., managed). Very early in the work of the Committee, it became clear that NASA's processes for analyzing failure mocles, effects, ant} hazards couic] only be understood ant] evaluated intelligently when viewed as elements of an overall program of risk assessment and risk management. In the Committee's view, any such program should include the following basic elements: Risk assessment: A comprehensive method for identifying po- tential failure mocles ant] hazards associates! with the system. A specific, quantitative methodology for iden- tifying and assessing (or estimating) the safety risks of the system. Risk management: A management process by which the safety risks can be brought to levels or values that are acceptable to the final approval authority. Risk management includes establishment of acceptable risk levels; the institution of changes in system design or operational methods to achieve such risk levels; system valiciation ant] certification; and system quality assurance. The basic organizational elements are in place within NASA for assessing anc! managing risk; however, there is a need for a change in the scope of functions ant! the way that they are carried out. The Committee believes that the management of the risks of the STS must be the responsibility of line management (i.e., the NSTS Program Manager, the Associate Administrator for Space Flight anti, ultimately, the Administrator of NASA). Only this program management, not the safety organizations, can make judicious use of the means available to achieve operational goals while controlling the safety risks at acceptable levels throughout the evolution of the program. The safety organizations at NASA centers and Headquarters are staff or- ganizations as such, they can and shouIcl be responsible for providing assessments of the sys- tem's risks. They should also be responsible for assuring that the activities associated with con- trolling the risks to the specified levels have been carried out and documentecI. Safety organizations cannot, however, assure safe operation. Certain shortcomings in process and methodol- ogy exist which are cliscussecT in Section 5 anal summarized in Section I.3 below. In particular, there is a fundamental problem in the nature of anc! the methods used to develop the overall as- sessments on which NASA line management bases its decisions about how to reduce ant] control risk in the STS. Risks in STS operations now are assesses! based on subjective judgments and accepted on the basis of qualitative rationales, although many quantita- tive engineering analyses ant! test data relevant to risk assessment are available and often are used in arriving at what are finally qualitative, subjective jucigements. With such a non-specific (i.e., non- vatue based) risk acceptance process there is little basis for making objective comparisons of the several major risk categories associated with the STS, nor for carrying out risk evaluations by independent agencies. Neither can one systemati- cally track the efforts to reduce the risk or impact of the various possible failures. Without more objective, quantifiable measures of relative risk it is not clear how NASA can expect to implement a truly effective risk management program. However, the Committee does not wish to suggest that NASA subordinate sound technical jucigement to numer- ical analysis. Such an approach wouIcl be, in our opinion, unrewarding and counterproductive.

OCR for page 1
1.3 FINDINGS AND RECOMMENDATIONS Following are the major findings of the Com- mittee and the specific recommendations associated with them. The summary finclings and recommen- dations are extractec! from Section 5 of the report, which includes a discussion of each one. The subsection numbering here parallels that in Section 5. For example, Subsection I.3.l corresponds to Subsection 5.1, 1.3.2 corresponds to 5.2, and I.3.9.l corresponds to 5.9.~. In addition, the rec- ommenciations are numbered sequentially and iclen- tically in both sections. It should be noted that the recommendations are not listed in any priority order. 1.3.1 Critical Items List Retention Rationale Review and Waiver Process The Committee views the NASA critical items list (CIL) waiver decision making process as being subjective, with little in the way of formal and consistent criteria for approval or rejection of waivers. Waiver decisions appear to be driven almost exclusively by the clesign-based FMEA/CIL retention rationale, rather than being based on an integrated assessment of all inputs to risk manage- ment. The retention rationales appear biased to- ward proving that the design is "safe," sometimes ignoring significant evidence to the contrary (see Section 5. ~ ). Although the Safety, Reliability, and Quality Assurance (SR&QA)2 organizations of NASA col- lect, verify, ant] transmit all data related to FMEA/ CIL and hazard] analysis results, the Committee has not fount] an inclependent, cletailed analysis or assessment of the CTE retention rationale which considers all inputs to the risk assessment process. Recommenciations (1~: The Committee recommends that NASA estab- lish an integrated review process which provides a comprehensive risk assessment ant! an independent evaluation of the rationale justifying the retention of Criticality ~ and I R items. This integrates] review should] include detailed consideration of the results of hazard analyses and all other inputs to the risk 6 ~ As of September 1987, the NASA Headquarters organization is called Safety, Reliability, Maintainability, and Quality Assurance (SRM&QA), while the similar organizations at the NASA centers are still named SR&QA. In this report, SR&QA also is used to refer generically to this function. assessment process, in addition to the FMEA/CIL retention rationale. Further, the review process shouic! assure that the waivers ant! supporting analyses fully reflect current (lata anct designs. Finally, NASA should develop formal, objective criteria for approving or rejecting proposed critical . . Item waivers. ..3.2 Critical Items List Prioritization and Disposition At present, in NASA instructions all Criticality ~ ant! IR items are formally treated equally, even though many differ substantially from each other in terms of the probability of failure or malper- formance, and in terms of the potential for the worst-case effects postulated in the FMEA to be seen if the particular failure occurs. The large number of Criticality ~ ant! IR items at the time of the 51-L accident has since been substantially increased clue to changes in ground rules for classification and the complete reevalua- tion of the entire STS. The Committee believes that giving equal man- agement attention to all Criticality ~ and ~ R potential failures conic be cletrimental to safety if, as is the case, some are extremely unlikely to occur, or if the probability is very Tow that the postulated worst-case consequences of the failures will result. Treating all such items equally will necessarily detract from the attention senior management can give to the most likely and most threatening failure mocles. Recommendations (2J: The Committee recommencIs that the formal criteria for approving waivers inclucle the proba- bility of occurrence and probability that the worst- case failures will result. We further recommenc! that NASA establish priorities now among Criti- catity ~ and ~ R items, taking care not to use ambiguous measures of risk and probability. NASA shouIc! also moclify the definitions of criticality in terms of the probability of failure anc! probability of worst-case effects. Finally, we recommenc! that NASA Level ~ management pay special attention to those items iclentifiec! as being of highest priority, along with the rationale that proclucec! the priority rating. Responsibility for attending to lower-prior- ity items within the present Criticality ~ and IR categories, when reclassified, should be clistributec} to Levels Il anc! Ill for cietailec! evaluation ant! c .eclslon. 4

OCR for page 1

OCR for page 1
6 date. Data bases derived from STS failures, anom- alies, and flight and test results, and the associated analysis techniques, should be systematically ex- panded to support probabilistic risk assessment, trend analyses, and other quantitative analyses relating to reliability and safety. Although the Committee believes that probabilistic risk assess- ment approaches will greatly improve NASA's risk assessment process, it recognizes that these ap- proaches should not substitute for good engineering and quality control practices in design, develop- ment, test, manufacturing, and operations, all of which must continue to receive high priority em- phasis by NASA and its contractors. The Com- mittee further recommends that NASA build up its capability in the statistical sciences to provide improved analytical inputs to decision making. 1.3.7 The Need for Integrated Space Transportation System Engineering Analysis in Support of Risk Management NASA safety-related analyses tend to focus pri- n~arily on single-event, worst-case failures to the relative exclusion of possible multiple and syner- gistic failures in different subsystems or elements of the STS. In addition, the connection between the various analyses appears tenuous. There does not appear to be an adequate integrated-system view of the entire STS. Recommendation (7J: A "top-down" integrated system engineering analysis, including a system safety analysis, that views the sum of the STS elements as a single system should be performed to help identify any gaps that may exist among the various "bottom- up" analyses centered at the subsystem and element levels. 1.3.8 Independence of the Space Transportation System Certification and Software Validation and Verification Program In general, hardware certification and verifica- tion, and software validation and verifications in STS are managed and conducted primarily by the same organizational elements responsible for the design and fabrication of the units. Thus, the 3 See Appendix A for definition of these terms. independence of the certification, validation, and verification processes is questionable. For example: The contractor that builds the Orbiters (Rock- well International, STS Division) is also responsible for preparing the documentation and performing the work involved in certification, but does not answer to an entity independent of the NSTS Program with regard to the certification function. At Marshall Space Flight Center (MSFC), the Engineering Directorate has the prime responsibii- ity for design requirements for the propulsion elements of STS and also has responsibility for the review and approval of their certification. The Program Office is responsible for the design and development phase as well as for performing the . ~ . . . . cert~hcat~on activities. At the Johnson Space Center ~ ISC), prime responsibility for design requirements, design and development, and certification for the Orbiter all rest with the Program Office, supported by the Engineering and Operations Directorates of the Center. "Independent" validation and verification (IV&V) of software is carried out by the same contractor (IBM) that produces the STS software, with some checks being made by the Johnson Space Center (JSC). Recommendation (8J: Responsibility for approval of hardware certifi- cation and software TV&V should be vested in entities separate from the NSTS Program structure and the centers directly involved in STS develop- ment and operation. However, these organizations should continue to conduct activities supporting certification and IV&V. 1.3.9 Operational Issues 1.3.9.1 Launch Commit Criteria Waiver Policy An average of two Launch Commit Criteria (I=CCs) are waived by NASA in the course of each launch. The Committee questions the validity of an operational procedure that "institutionalizes" waivers by routinely permitting established criteria to be violated. Recommendation (9aJ: The Committee recommends that NASA estab- lish a list of mandatory LCCs which may NOT be 6

OCR for page 1
; (l waived by anyone. This should comprise the bulk of the LCCs. A limiter] number of criteria wouIc] be separately listed, for special cases, together with a discussion of the circumstances uncler which they may be waiver] ant] who may make the waiver . . c .eclslon. 1.3.9.2 Human Factors as a Contributor to Risk Human factors, which are consicierec] in some of the STS hazard analyses, clo not appear to be taken into account as the cause of failure monies in the FMEAs. Since the FMEA is one of the principal safety tools used in the evaluation of the STS design, the Committee believes that the STS design process shouIc] explicitly consider ant! min- imize the potential contribution of humans to the initiation of the clefined failure modes. Recommendation (9b): The Committee recommends that the NASA FMEA include human factors among the recog- nizec! sources of potential causes of failure mocles. This step would provide another valid link between the FMEA and the hazard analysis, which are now, in our view, too tenuously connected. 1 .3. 9.3 Cannibalization of Spare Parts By the time of the Challenger accident, "canni- balization,', the removal of parts at the Kennedy Space Center (KSC) from one operational STS element to fulfill spares requirements in another, hacl. become a prevalent feature of STS logistics, thus introducing a variety of failure potentials associates! with human error. Cannibalization is not evaluates] as a producer of potential failure in either the hazard analysis (where it would be most appropriate) or the FMEA. Recommendations (9cJ: The Committee recommends that NASA main- tain its current intense attention towarc] reducing cannibalization of parts to an acceptable level. We further recommenc! that adequate funds for the procurement ant! repair of spare parts be made available by NASA to ensure that cannibalization is a rare requirement. Finally, we recommend that NASA inclucle cannibalization, with its attendant removal and replacement operations, as a potential producer of failure in the integrated risk assessment recommended earlier (Section I.3. ~ ). 1.3.10 Other Weaknesses in Risk Assessment and Management 1.3.10.1 The Apparent Reliance on Boards and Panels for Decision Making The multilayered! system of boards and panels in every aspect of the STS may leacl inclividuals to defer to the anonymity of the process and not focus closely enough on their incliviclual responsibilities in the decision chain. The sheer number of STS- related boards and panels seems to produce a minclset of "collective responsibility." Recommendation (boat: The Committee recommends that the Adminis- trator of NASA periodically remind all NASA personnel that hoards and panels are advisory in nature. He should specify the inclivicluals in NASA, by name and position, who are responsible for making final decisions while considering the advice of each pane! and board. NASA management should also see to it that each individual involved in the NSTS Program is completely aware of his/ her Responsibilities and authority for decision mak- ~ng. 1.3.10.2 Adlequacy of Orbiter Structural Safety Margins The primary structure of the STS has been excluclecl, by definition, from the FMEA/CIL proc- ess, based on the belief that there is an adequate positive margin of safety. However, the Committee questions whether operating structural safety mar- gins have actually been proven adequate. Completion of the Mociel 6.0 loads study and the reevaluation of margins of safety baser! on these loads will significantly improve NASA's grasp of actual operating margins of safety. Recommendations (l Ob): The Committee recommends that NASA place a high priority on completion of the Model 6.0 loads, the reevaluation of safety margins for these loacis, ant] the early verification and continuer! monitoring of the moclel 6.0 loads by permanently instru- menting an(l calibrating at least the next full scale STS vehicle to fly. We further recommend that NASA complete and implement a comprehensive plan for conducting periodic inspection and' main- tenance of the structure of the Orbiters throughout the service life of each vehicle. 7

OCR for page 1
1.3.10.3 Software Issues NASA FMEAs do not assess software as a possible cause of failure modes. There is little involvement of lSC Safety, Relia- bility, and Quality Assurance in software reviews, resulting in little indepenclent quality assurance for software. A large amount of ciata- much of it flight spe- cificmust be loaded for each Shuttle mission but it is not subjected to validation as rigorous as that for the software. Recommendations (lOcJ: The Committee recommencis that NASA: explore the feasibility of performing FMEAs on software, including the efficacy of identifying and predicting fault and error modes; request lSC SR&QA to provide periodic review and oversight of software from a quality assurance point of view; provide for valiciation of input ciata in a manner similar to software valiciation ant] verification. 1.3.10.4 Differences in Procedures Among NASA Centers Differences in the procedures being used by the main NASA centers involved in the NSTS Program may reflect an imbalance between the authority of the centers and that of the NSTS Program Office. The Committee is concerned! that such an imbalance can lead to serious problems in large programs where two or more centers have major roses in what must be a tightly integrated program, such as the NSTS and Space Station. Without strong, central program (Erection and integration, the suc- cess and safety of these complex programs can be placecl in jeoparcly. Recommenciatior' (lOci): The Administrator should ensure that strong, central program direction and integration of all aspects of the STS are maintainer! via the NSTS Program Office. 1.3.10.5 Use of Non-Destructive Evaluation 7 echniques Non-clestructive evaluation (NDE) tests on the Solid Rocket Motor (SRM) are performed at the manufacturing plant. Subsequent transportation ant! assembly introduce a risk of deboncling ant] other damage which may not be apparent upon visual inspection. No NDE is clone on the SRMs in the "stacked" configuration at the launch facility. New NDE techniques now being developed have potential applicability to the STS. Recommendation (lOeJ: The Committee recommencis that NASA apply all practicable NDE techniques to the SRM at the launch facility, at the highest possible level of assembly (e.g., SRMs in the "stacked" configura- tion), anc] emphasize clevelopment of improver} NDE methods. 1.3.11 Focus on Risk Management The current safety assessment processes used by NASA do not establish objectively the levels of the various risks associated with the failure mocles and hazards. It is not reasonable to expect that NASA man- agement or its panels and boards can provide their own detailed assessments of the risks associated with failure mocles ant! hazards presented to them for acceptance. Validation and certification test programs are not planned or evaluatecl as quantitative inputs to safety risk assessments. Neither are operating con- ditions and environmental constraints which may control the safety risks aclequately clefinec! an evaluated. In the Committee's view, the lack of objective, measurable assessments in the above areas hinclers the implementation of an effective risk management program, including the reduction or elimination of risks. Recommenclations (11J: The Committee recommends that NASA con- sicler establishing a focuses! agency-wide Systems Safety Engineering (SSE) function, at both Head- quarters and the centers, which wouIcl: be structure`] so as to be integrally involved in the entire set of clesign, clevelopment, validation, qualification, and certification activities; provicle a full systems approach to the contin- uous identification of safety risks (not just failure modes and hazarcls) anal the oh jective (quantitative) evaluation of such safety risks; provide the output of this function to the NASA Program Directors in support of their risk management; and 8

OCR for page 1
support the Program Directors by providing assurance that their systems are ready for final safety certification to the risk levels establisher] by the NASA Administrator. The Committee also recommencis that the STS risk management program, baser! in part on the definition of the potential to recluce the level of risk cievelopect by the system safety risk assessment, include a concerted effort to remove or recluce the risks. 1.4 CLOSING REMARKS Although this report and its recommendations are clirected to the NSTS Program, most of them are of broacler applicability. It would be wise to consider the lessons learned here when structuring 6 a risk assessment ant! management system for other programs which have similar attributes, such as the Space Station. The safety of other large systems involving highly complex technology, and requiring major participation by several NASA centers and prime contractors, couIc] benefit from an integrated risk assessment ant] management program based on the current NASA procedures supplementecl by those recommender] in this report. For any new program, such as the Space Station, there is the opportunity to structure an optimum risk assess- ment and management program at the outset by assembling those elements of risk assessment and management which will be most effective in estab- lishing, monitoring, and controlling safety risks to accepted levels. (See Section 6.) 9