Click for next page ( 41


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 40
5 National Space Transportation System Risk Assessment and Risk Management: Discussion and Recommendations - .1. CRITICAL ITEMS LIST RETENTION RATIONALE REVIEW AND WAIVER PROCESS The Committee views the NASA critical items list (CIL) waiver decision making process as being subjective, with little in the way of formal and consistent criteria for approval or rejection of waivers. Waiver decisions appear to be driven almost exclusively by the design- based~ FMEA/CTE retention rationale, rather than being based on an integrated assessment of all inputs to risk management. The retention rationales appear biased toward proving that the design is "safe," sometimes ignoring sig- nificant evidence to the contrary. Although the Safety, Reliability, and Quality Assurance (SR&QA) organizations of NASA collect, verify, and transmit all data related to FMEA/CIL and hazard analysis results, the Committee has not found an independent, detailed analysis or assessment of the CTE retention rationale which considers all inputs to the risk assessment process. As set forth in the NASA documents identified in Section 3.l, both the performance of the Failure Modes and Effects Analysis (FMEA) and the iden- tification of critical items are intended to be carried out under the aegis of the reliability function. In principle, the FMEA shouic! be both a design too' to provide an impetus for design change, and a too' for the evaluation of the final configuration in order to define the necessary control points on the 40 hardware. The identified critical items would re- quire supporting retention rationale and waivers as appropriate in order to be included in the overall as-flown system configuration. How this retention rationale was to be generated, who developed it and who evaluated it against what safety criteria became crucial questions for the Committee's re- view of the whole process. According to prescribed procedures, the hazard analyses being performed by the safety function of SR&QA, and the FMEA and CTE identification performed by the reliability function, were to come together in the generation of Mission Safety As- sessment (MSA) reports which would contain analyses and justification of the retention rationale for the critical items and their associated "hazards", as well as a safety-risk assessment of the resulting units, subsystems, and systems. The hazard analysis and Mission Safety Assessment parts of this overall safety and risk assessment process as it was sup- posed to be done prior to 1986 are shown in Figure 5-l, obtained from ISC's SR&QA. As Figure 5-1 indicates, according to specified NASA procedure the CIL retention rationale is to be used as one of many inputs to the more com- prehensive hazard analysis. In reality, however, the hazard analysis is often simply a derivative of the CTE and its retention rationale, and is not used as a major basis for waiver decisions. Examination by the Committee showed that often these retention rationales were simply discussions of the hard- ware's specifications, design, and testing. They were generated primarily by the functional development engineers responsible for the design. They are intended to be justifications, an(l do not, in our

OCR for page 40
t 1 In a a F fir o I .4 Z Z o Z Z U) ~ . V) In LIZ ~ C , ~ ~ ~ ~ , ~ - Ct: In In < Z o IL ~ In Z NolS3a 41 OX' a: ~ ~ 1~ Oozes, I' -- \ ~< / \ / ~ i . ~ 1 fir: ~ o fir {c IL Z ~ o o F 3< _ J _ J Z Z _ ~ Z 1 ~ O In z - ~r . ~ - L m U. ~ . C} ~n ~2 ~n ~ o o . _ Q . _ o Q 6 cn z >- a Q . _ C~ Ct) a) Q CO CO CO o ~+ ~n a) . _ > . _ O a) c: 6 o ~ CO o ~ ._ , Ct .o . _ CD z _' Ct I _ U' a)

OCR for page 40
view, provide a true assessment of the risk of the hazards. Sometimes the rationale appears to be simply a collection of judgments that a design shouIc! be safe, emphasizing positive evidence at the expense of the negative, and thus cloes not give a balanced picture of the risk involved. For example, the CIL retention rationale of December 1982, for the Solid Rocket Motor (SRM) inclicated in support of re- tention that: there had been no failures in three qualification, five development, and ten flight mo- tors; there hacT been no leakage in eight static firings and five STS flights; 1076 Titan TIT joints (presum- ably of similar clesign) were tester] successfully; etc. Missing from the retention rationale was, among other points, any discussion of the dissimilarities between the SRM and Titan Ill (e.g., insulation design and combustion pressure on the O-ring); the O-ring erosion observed in the Titan IT! program and on the second STS flight; a failure during an SRM burst test; and, since the rationale was not updated, all of the O-ring anomalies seen after December 1982. Furthermore, in many cases we reviewed: O No specific methoclology or criteria are estab- lished against which these justifications can be measured. The true margins against the failure modes often are not defined or explicitly validated. The probability of the failure mode is never establisher! quantitatively. Design "fixes" are accepted without being analyzed and compared with the configuration they are replacing on the basis of relative risk. The point is worth reiterating: The retention ra- tionale is user! to justify accepting the design "as is"; Committee audits of the review process dis- coverec! little emphasis on creative ways to e~imi- nate potential failure modes. Since 51-L, there has been a major increase in the attention and resources given to STS SR&QA and risk assessment and management functions at all levels of NASA ant] its contractors. In 1986, NASA appointed an Associate Administrator at Heaclquarters for Safety, Reliability, Maintainabil- ity, and Quality Assurance (SRM&QA) an(l charged him with establishing a NASA-wide safety an(l risk management program. To implement this program, policy directives are being cleveloped relating to various procedures ant! operational requirements. Specific instructions and methoclologies to be used in the conduct of various analyses ant! assessments, such as hazard analyses, are being clevelopecI. Independent institutional assessments ant! audits will be macle of SR&QA activities ant! technical effectiveness at each NASA center. Some important elements of this revamped NASA safety programinclucling hazard analysis and mission safety assessment are depicted in Figure 5-2, which was obtained from the ISC SR&QA organization in May 1987. Several things shown in the figure shouic! be noted. First, there is now a specific new set of NSTS instructions to all con- tractors and NASA organizations for conducting hazard analyses, and for preparing FMEAs and CTEs for the NSTS (these new instructions affect the activities in the boxes in Figure 5-2 marked A. Second, it can be seen that the FMEA/CTE docu- ments are intended to be one of many inputs into the hazard analysis and Hazard Report, which in turn are shown as an input into the Mission Safety Assessment. However, since (as discussed in Section 4.2) the Hazard Reports do not provide a comprehensive risk assessment, nor are they even required to be an independent evaluation of the retention rationale stated in the CTEs, the Committee believes that NASA plans at least for the near term to con- tinue using the retention rationale of the CILs directly and individually as the basis for Criticality ~ and IR waiver justifications to Levels IT and T. We have indicated this by ad(ling the Criticality ~ and IR waiver path within the dashed lines on the left side of Figure 5-2. The current plan is to take the critical item waiver requests to the PRCB and Level ~ via a data package prepared by DISC SR&QA. It is our impression, however, that most of the arguments in this data package will still basically be those contained in the original CTE retention rationale. Thus, we see too little in the way of an independent detailed analysis, critique, or assess- ment of the risk inherent in Engineering's rationale. Since mid-1986, NASA and its contractors have been performing a massive rework of all STS program FMEAs, updating the resulting Clips, and reviewing all prior HAs. This new FMEA/ClI effort has had value in identifying new failure modes that were missed earlier or introduced through past changes, and those resulting from new changes made mandatory before next flight. However, the new NSTS instructions for preparing FMEA/ClI s 42

OCR for page 40
NSTS 01700 DELIVERABLES _ SD-77-St0113 _ RISO DELIVERABLES _ ~ ROCKWELL HA OES INSTRUCTl[lN Ann 7e ;- ~ N PD 1 700 1 BASIC POLICY OH SAFETY _ ~ NHB 17001 (Vl-A) BASIC SAFETY MANUAL _ . . NHB 5300 4 ( 1 0-2) SR&QA AND MAINTAINABILITY PROVISIONS FOR THE SPACE SHUTTLE NHB 1 700 1 (V3) SYSTEM SAFETY METHODOLOGY ._ NSTS 22254* METHODOLOGY OF CONDUCT OF NSTS HAZARD ANALYSIS . _ NSTS 22206* INSTRUCTIONS FOR PREPARING FMEA/CIL CILS ~ ___~___ l r l dSC SR DA l 1 1 l _ _ ~ _ _ _ J '1 ~ ll OATA PACKAGE l t- - r- - -a L. I LEVEL I I AUTHORITY Len_. _ _ , ~ ,,,,,Ls ~ , FMEA CIL _ I DOCUMENTS ~ ~ r CRIT I&JR J l | PREPARED HAZARD REPORTS l RISO SHUTTLE HAZARDS ~ INFORMATION MANAGEMENT I ~1 RISO ERB l SYSTEM (SHIMS) l - rat | SUBSYSTEM MANAGER 1: ~ SAFETY l MISSION OPERATIONS ' _ . DIRECTORATE l . LEVEL Il | PRCB I I [ _ _ . TY S U B PA N E L l l 1 ' ~ J I r - 1 l l | PROJECT MANAGER | ~ i___ 1 1 | ORBITER CONFIGURATION CONTROL | | BOARD (CCB BASELININGI* WAIVER REQUEST OATA PACKAGE | MISSION SAFETY ASSESSMENT | Dashed boxes added by the Committee * New procedures added since 51-L I NSTS 0700 1 I TECHNICAL RFnillRFll.F~Tc I PREVIOUS EXPERIENCE DESIGN ENGINEERING STUDIES SAFETY ANALYSES SAFETY STUDIES CRITICAL FUNCTIONS ASSESSMENTS FMEA S/CIL S CERTIFICATION PROGRAM SNEAK ANALYSES . MILESTONE REVIEW DATA/RID S . PANEL MEETINGS . CHANGE EVALUATIONS FAILURE INVESTIGATIONS WAIVERS/DEVIATIONS OMRSD S/OMI S WALKDOWN INSPECTIONS MISSION PLANNING ACTIVITIES FLIGHT ANOMALIES . ASAP INPUTS INDIVIDUAL INPUTS l PAYLOAD INTERFACES l | SENIOR SAFETY REVIEW BOARD | ~ 1 | LEVEL II PRCB BASELINING l ~- | NASA SPACE SHUTTLE | ~ | HAZARDS OATA BASE 1 FIGURE 5-2 NASA JSC safety analysis, hazard reports, and safety assessment process in 1987, as modified by the Committee (adapted from NASA JSC SR&QA). (NSTS 22206) have also resulted in a large increase in the number of Criticality ~ and IR items. The Committee believes this new complexity will pose additional severe problems for both the mechanics and credibility of the CIL and waiver processes. The strong dependence on the CIL retention rationales in waiver (recisions makes it critical that they be comprehensive and up to Late. It is not clear to the Committee whether, in the pre-51L environment, changes in the STS configuration or

OCR for page 40
the operational experience base led directly and surely to review and appropriate updating of the relevant CIL retention rationale. In the wake of the 5 l-L accident, the NSTS program issued a document (NSTS 22206) which is intended to strengthen the process for updating the retention rationale. Once a retention rationale has been accepted and a waiver granted for a critical item, any changes to the item itself, the FMEA, or the CIL that could affect the retention rationale mean that the CIL must be resubmitted to the Level Il/l PRCB for its approval (NSTS 22206, p.2-7, para.2.2.61. Any change, whether it be to the test environment, level, procedures, methods, or fre- quency, is to be reflected in changes to the retention rationale. If crew procedures are changed to reduce risk, corresponding changes are also to be made in the retention rationale. The question is whether this updating is con- ductect regularly and in a consistently rigorous fashion. Although this policy is new and may not yet have been fully imposed in all quarters, NASA and contractor personnel interviewed by the Com- mittee seemed variously uncertain about or una- ware of these requirements and how they are met. Updating the retention rationale seems to many to be considered a routine bookkeeping chore, of secondary importance, yet these rationales are the r , primary casts tor granting waivers. During its audit the Committee developer! a concern that the FMEA and associated retention rationale on a given crltlca Item may sometimes fait to provide data in various important categories of information, such as the effects of environmental parameters. The lack of data in a certain case may or may not be significant with respect to the threat that item represents. Yet the absence of such data, even though it resulted in uncertainty, in the past has sometimes had the effect of bolstering the rationale for retention and providing unwarranted confidence in readiness reviews. This problem was especially in evidence with Mission 5 I -it. Data suggesting that temperature was a factor in the erosion of the O-rings did exist, but (according to the Rogers Commission) the relevant analyses ap- parently were considered to be inconclusive by those responsible, and these data did not appear in the retention rationale. Thus, the rationale im- plied that there were no data to suggest that temperature was a problem. Strengthening and closing the problem reporting loop since the a`.ci- dent may well reduce the likelihood of sim itar future occurrences. Still, we note that the "negative answer" indicates uncertainty about the issue at hand. If the uncertainty is crucial to the decision process, then it implies the need for more experi- ments, tests or analyses to reduce the uncertainty. (Appendix E includes an analysis of the O-ring temperature effect and the uncertainty implied by extrapolation to low temperatures.) Thus, the Committee's central concerns here are the reliance on and quality of the retention ration- ale, and the fact that we can perceive no clocu- mented, ob jective criteria for approving or rejecting proposed waivers. CIL waiver decision making appears to be subjective, with no consistent, formal basis for approval or rejection of waivers. At! items are considered and discussed at length during the CCB and PRCB reviews. It appears that, if no action item is generated as a result of the review, the critical item waiver is approved. There was no formal "approved or disapproved" step in meetings audited by the Committee, although we are in- formed that such approvals do appear in the minutes of the meetings. NASA managers empha- size that Level Ill engineers and their "Leve! IV" contractors are accorded a high level of responsi- bility and accountability throughout the program, and that their opinions and analyses are the real bases for making retention decisions; these engi- neers bear the burden of proving that the rationale is strong enough to justify retention and waiver of the item. However, the Committee believes that engineer- ing judgment on these matters is not enough. Such judgment is crucial, but it is often too susceptible to vagaries of attention, knowledge, opinion, and extraneous pressures to be the sole foundation for decision making. We are concerned that, for all the reasons discussed above, without professional, detailed evaluation against specific criteria for re- ducing risk (not just review by panels and boards), the retention rationales can be misleading or even incorrect regarding the true causes and probabilities of the failure modes for which retention waivers are being requested (see discussion of probabilistic risk assessment in Section 5.6~. Recommendations (1~: The Committee recommends that NASA estab- lish an integrated review process which provides a comprehensive risk assessment ancl an inclepen(lent evaluation of the rationale justifying the retention 44

OCR for page 40
of Criticality ]/]R and 2/2R items. This integrated review should include detailed consideration of the results of hazard analyses and all other inputs to the risk assessment process, ir' addition to the FMEA/CIL retention rationale. Further, the review process should assure that the waivers and sup- porting analyses fully reflect current data and designs. Finally, NASA should develop formal, objective criteria for approving or rejecting critical item waivers. 5.2 CRITICAL ITEMS LIST PRIORITIZATION AND DISPOSITION At present, in NASA instructions all Criti- catity ~ and ~ R items are formally treated equally, even though many differ substantially from each other in terms of the probability of failure or malperformance, ant! in terms of the potential for the worst-case effects postulated in the FMEA to be seen if the particular failure occurs. The large number of Criticality ~ ant] JR items at the time of the 5 I-L accident has since been substantially increased clue to changes in grounc! rules for classification and the complete reevaluation of the entire STS. The Committee believes that giving equal management attention to all Criticality ~ ant! IR potential failures could be cietrimental to safety if, as is the case, some are extremely unlikely tO occur, or if the probability is very low that the postulates] worst-case conse- quences of the failures will result. Treating all such items equally will necessarily detract from . . . . the attention senior management can give to the most likely and most threatening failure mocies. Critical items in the Shuttle system are catego- rized according to the consequences of worst-case failure of that item. However, it has been the case that within each criticality category no further ranking is formally macle. In practice, managers do sometimes discriminate within a category, e.g., in their decisions regarding those STS items which should be fixed prior to next flight. Prior to the 51-L accident there were aireacly 2369 Criticality and iR items (the most critical) present in the Shuttle system. There has been a substantial in- crease in the number of such items, now estimates! by NASA to be 4686, of which 2148 have been approved by the PRCB (Director, ~SC/SR&QA, personal communication, November ~ 0, ~ 9 8 7) . This increase resulted from the reevaluation of the entire Space Shuttle system and the new ground rules specified for the preparation of FMEAs- e.g., the carrying of analyses down to the indiviclual component level (even where multiple, identical components are involved) and the inclusion of pressure vessels which were formerly exclucled (see Section 3.5.2~. To take just one example, the number of Criticality ~ and IR items in the SSME turbomachinery rose from 8 to 67 uncler the new ground] rules. In view of this problem, NASA is now taking steps to prioritize the most critical items and will reevaluate the current scheme for defining levels of criticality. Initially, the reassessment process seemed to the Committee to be tOO heavily focused on Level I. The presence of a very large number of Criticality ~ and IR itemseven admitting that many are clustered with identical itemsobviously places a heavy clemand on the time and attention of key NASA decision makers and could prevent their penetrating deeply enough into the analyses sur- roun(ling each item to make a valic] decision on all of them. We were concerned! not only about the workload placed on Level ~ management, but also about the danger that crucial technical details might be lost or obscured as the rationale for retention was presented at successively higher levels. Al- though the same information is presented at the Level T! and ~ PRCBs, it seemed entirely possible that technical debates occurring at lower levels might not be adequately relayed to Level I. A post-51 L organizational change that shifter] the Level I] NSTS Program Director at JSC to Level at Headquarters has Deviated these concerns to some extent. NASA recognized that the waiver clecision-making flow was not icleal especially from I=eve! I! to Level I. Consequently, the Level NSTS Director (who also chairs the Level ~ PRCB) now participates in the Level Il reviews as a basis for sign-off at Level I. Thus, there is now a more direct "hand-off" of concerns and rationales from Level Ill to Level I, via Level Il. Nevertheless, the process still places a heavy workloacl on Level T. and there is still a cianger that important technical information might be Lost in transmission. The organizational change streamlinecl the waiver decision-making process, but it flick not help in 45

OCR for page 40
i ' ~ ?% handling the large number of Criticality ~ and JR items. Many of these items differ substantially from each other in terms of the probability of failure or malperformance, and in terms of the possibility that the worst-case effects postulated in the FMEA will be seen in the event the particular failure does occur. (In this connection it might be noted that, prior to 51-L, 56 Criticality ~ failures occurred] on the Orbiter during flight without any of the pos- tulated worst-case effects resulting.) Thus, the items vary considerably in their potential impact on Shuttle operational safety i.e., on risk. Early in its audit the Committee began urging NASA to find a way to prioritize the Criticality ~ and IR items (see Appendix C, first interim report). NASA managers tendec] to assert that, since all Criticality ~ and IR items are (by definition) equally catastrophic in their consequences, all shouic! be treated equally and, indeed, we saw evidence in our audits that they were handler! with equal attention. But it is the position of the Committee that giving equal management attention to all such items could be detrimental to safety if (as is the case) some are extremely unlikely to fait, or the probability is very low that the postulated worst- case consequences of the failures will result. The most likely ant] most threatening failure mocles merit the most attention. It is illogical to dissociate the probability of an event or its consequences from decisions about the management of risk. For example, in the development of a probabil- istic risk assessment for a modern nuclear power plant, fault tree and event tree analyses typically identify several million potential sequences of events (including multiple independent failures and cas- cading failures) that can lead to core melt-down. However, only 20 to 50 of these sequences con- tribute significantly to the risk, with five to ten of them contributing 90/O of the risk. These particular sequences are exhaustively analyzed to identify ways to substantially reduce the overall risk. A secondary consideration of the Committee was the possible impact of the disclosure that, as the resumption of Shuttle operations nears, there are more Criticality ~ ant] IR items (with all of them being waived) than there were before the accident. That perception would not be justifier] by, and would not fairly reflect, the real strides in system safety that have been macle since 51-~. Responding to suggestions on the part of the Committee, NASA developed and tested a number of techniques that could be used to prioritize the CIL on the basis of the relative risk each item represents. One such schemetermec! the Critical Item Risk Assessment (CTRA) procedurewas se- lectec! and instructions for its implementation have now been promulgates] throughout the NSTS pro- gram (NSTS 2249 I, June ~ 9, ~ 9 8 7) . The CTRA procedure is currently qualitative in nature although it employs reliability and test data to some extent. It is based instead on judg- ments about the degree of threat inherent in dif- ferent risk factors. The Committee is concerned about the potential negative impact on the CIRA of ambiguous measures of risk and probability. However, the technique does lend itself to the incorporation of more rigorous quantitative meas- ures of risk and probability of occurrence as these measures are developed for use within NASA. (See Appendix E for a discussion of CTRA and one approach to quantitative measures suggested by the Committee.) Current plans for the implementation of CIRA, spelled out by the NSTS Deputy Director (Program) in a memorandum dated July 2l, 1987, are for STS project managers to prioritize the Criticality 1, TR,and ISitemsin each project after completing the FMEA/CIL reevaluation and presenting the CTE at the Level TIT CCB. By two weeks before Design Certification Review, each project manager wit! provide the NSTS Deputy Director (Program) with a list of "the 20 items in his project that represent the greatest risk to the program." The Deputy Director will then compile and distribute a report. This assessment effort will run parallel to, and may not actually affect, the preparations for STS-26 (the next schecluled Shuttle flight). However, "an alternate course of action" may be chosen for subsequent missions. The Committee views this implementation procedure with concern. It does not appear to reflect a serious concern on the part of the NSTS Program for the need to prioritize the CIL by assessing relative risks. Recommendations (2~: The Committee recommends that the formal criteria for approving waivers include the proba- bility of occurrence and probability that the worst- case failures will result. We further recommend that NASA establish priorities now among Criti- callity ~ and ~ R items, taking care not to use ambiguous measures of risk anal probability. NASA should also modify the definitions of criticality in 46

OCR for page 40
terms of the probability of failure and probability of worst-case effects. Finally, we recommend that NASA Leve/ ~ management pay special attention to those items identified as being of highest priority, along with the rationale that produced the priority rating. Responsibility for attending to lower-prior- ity items 'within the present Criticality ~ and JR categories, when reclassified, should be distributed to Levels I! and II] for detailed evaluation and , . . ctectslon. 5.3. HAZARD ANALYSIS AND MISSION SAFETY ASSESSMENT - NASA hazard analyses currently do not address the relative probabilities of a particular hazardous condition arising from failure modes, human errors, or external situations. The hazard analysis ant! the mission safety assessment clo not: address the relative prob- abilities of the various consequences which may result frown hazardous conditions; provide an independent evaluation of the retention rationales stated in the input CILs; or provide an overall risk assessment on which to base the acceptance and control of residual hazards. Hazard analysis (HA) is intenclec! to be a key part of NASA's safety and risk management proc- ess. Because it considers hazardous conditions, whatever their source, it is a top-down analysis that shouIc] encompass the FMEA ant] other bot- tom-up analyses and cover the safety gaps that these other analyses might leave. In reality, how- ever, the HA has not player] the central role it was designed to play. Instead, the main focus has been on the FMEA and its corresponding CTE retention rationale. These are design-based analyses, pre- pared by the project engineering staff. (See Section 5.~.) The Committee's audit of the FMEA/CTE re- evaluation and hazard analysis review produced, at first, a somewhat confusing and contradictory set of perceptions about the relationships between these safety analyses and the nature of the overall risk assessment and management process of which they are a part. Gradually, it became clear that there were differences between the officially pre- scribed process and the real process, as well as differences in the way the process is perceived by various NASA personnel, clepencling on their func- tion and point of view. Beyond that, there were also differences among the NASA centers in the implementation at the detail level. Figure 5-l (shown earlier), which was prepared by the Safety Division at iSC, depicts fairly accu- rately the process, as the Committee has come to understand it, that was prescribed by NASA policy at the time of the Challenger accident. Here, the HA is clearly an important element, buttressed by a number of complementary analyses including the FMEA/CIL. The ultimate product of the safety analysis is the Mission Safety Assessment (MSA), feecling into the deliberations of the various engi- neering and readiness review boards. Figure 5-3, also prepared by the Safety Division at JSC, shows the process from the perspective of that Division, focusing on the HA as the central activity. Note that the FMEA/CIL is listed as one of many inputs to the hazard analysis. The actual process appears to be quite different from the one suggested by the preceding two figures. During the latter part of 1986 anct the first few months of 1987, our audit led to the impression that, although some of the FMEA/CTEs were inputs into the HA function, the real risk acceptance process within NASA operated essentially as shown in Figure 5-4 (obtained from ISC). One can see from the diagram that the "Hazarcl Analysis As Required" is a deacI-end box, with inputs but no output with respect to waiver approval decisions. Our impression was supported by subsystem proj- ect managers, engineers anc] their functional man- agement at ISC. Many of them believer! that the CIL path shown in Figure 5-4 was the actual approval route for retention of designs with Crit- icality ~ and IR failure mocles. A key problem, in our view, is that the risk assessment shown in the box entitled "Retention Rationale and Risk Assessment" was not really an independent assessment of the risk levels by profes- sional system safety engineers; such indivicluals (and they are few in number within NASA) were "left out of the loop." Neither did the assessment contain an evaluation of how system hazards re- sulting from critical item failure modes wouicl be controlled. In practice, in most cases reviewed by the Committee, the retention rationales written on the CIL forms were simply transferred to the hazard analysis reports and became the basis for final acceptance of resiclual hazarcls, and for decision- making at Flight Readiness Reviews (FRRs). 47

OCR for page 40
Cot ~ - c] Al - ~ o #I z - UJ c:) J 0~ 3.smo a: I o ~ ~ O J < lit O ~ 5m .` in , O CO ka in ~ '~_ b , in . ~ tic L To , it in mo o ~ o in in o ~o^ . , in o CL Z ~ D: J Z Z ~ _ C) Z aS s ~ 3~1nHS Oz~ 00~O I SOVOlAYd 1 9uS 1 13 H3119~0 z cn UJ tn ~n ~L ~n z o co ~n , , 1 , I ~ 11V ' a . ~ ~ 6 c _ ~ j: :; ~ ~ ~ t 0 ~ ~ C ,` 0t ~ ~ a 0 ~ ,,, ~ a, - g 8 t; Z c 0 _ S c~ u~ ~n ~n ~ ~n ~n ~ ~ ~ ~ ~ c~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Z ~ ~ o ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ .. 48 C] tn z U] z o C~ ~n ~n ~n a) . CD N Cl) a) CO CD ._ ~ CE CO a) . ~ ~ tn - z o ~L Fz ~n Z ~ Z _ [L _ . 1 C: U? . _ _, . _ a) c~ <( .o a) cn Cf) c~ a) CO o _ ._ ~ cn ~ (n . _ LL ~

OCR for page 40
- - \ to ~ If ( - \ G G A, / Amoco LO ~ - LO J to - LucO - - ~ ' = - ~ z z z cO ~ O G L&' LU," CO G ~ CO 49 ~e`. co at: an G TIC IN OG ~ UJ cn CO a) ~0 Q a) l ~n o . _ C) Ct o Q a a~ J Z Ul O 3 m co c~ ~ c~ o Ul o ~ / c~ L~ 3 it z LU c~ ~ _ z co as LU o c~ a~ co 0 / L~ ~ _ cn > a) cn cn = a u' ~ IIJ o ct ~ au - ~

OCR for page 40
NASA does not use the HAs and (in turn) the MSAs as the basis for the Criticality ~ and IR waivers. In fact, HAs for some important subsys- tems were not upciated for years at a time even though design changes had occurrec! or dangerous failures were experienced in subsystem hardware. (An example is the ~ 7-inch disconnect valves between the ET and Orbiter.) The Committee's auclit showed that standards and detailec! instruc- tions for the conduct of HAs were not found to be consistent throughout the STS program; NSTS 22254 was issued to correct that problem. In summary, the Committee found in its review of the HA process that: 1. HAs were clone for only the largest subsystems of the STS; they acIdressed certain overlays of hazards but were not traceable to all failures in units within the subsystems. 2. HAs were not clone routinely for each major subsystem. The HA assumer] worst-case consequences and simply categorizes! hazard levels (cata- strophic or critical) based on whether there . , . was time tor counter-act~ons. 4. The HA process called for an independent evaluation of the HA results. Analyses of catastrophic ant! critical hazards were to be verified using risk assessment techniques. However, the HAs did not address the relative probability of occurrence of various failures, basec! on actual flight ant] test information, nor clic! they evaluate the validity of the CIL retention rationale against any formal set of . . crlterla. We found that many engineering personnel, functional managers, anc! some subsystem man- agers were unaware of what tasks must be done to complete the hazarc! analysis, die! not know whether they hac! actually been done, ant! clid not contribute to them. Some, in fact, believed that HAs were just an exercise done by reliability and/ or safety people and that they were reclunciant to the FMEA/CTEs. Their belief appears to be justified, in that these HA activities die! not seem to be authoritatively in-line as part of a true hazard control and risk management process. It appears they were carried out in a relatively sterile envi- ronment outside the mainstream of engineering. The safety personnel did use the HAs along with the FMEA/CIEs to create Mission Safety Assess- ments for the major elements of the STS ant! for the overall missions. These MSAs were to provide "a formal, comprehensive safety report on the final design of a system." However, in practice, the MSA reports essentially served as process assurance re- ports. They listec! the hazards ant! statec! whether they were eliminated or controlled; compared harcI- ware parameters with safety specifications; speci- fied precautions, procedures, training or other safety requirements; anct generally documented compli- ance with the various reliability and safety tasks. They did not provide in-clepth quantitative risk assessments, and relief] almost exclusively on the CILs and HA reports for justification of acceptable risks. New design changes and/or flight data were "examinecl" and "jucigect" for safety by various personnel and boards at NASA Levels TIT, If, and I; the vehicles for the approval of changes appear to have been the FRRs and various special reviews. The HA anc! MSA reports were not viewer! as controlling documents on a specific system config- uration which was judged to be safe by the safety organizations. The initial waivers to fly Criticality ~ anc! IR items were not always redone in a timely way after new data were obtained. Thus, our audit supports the impression that the hazarc! analysis is not used to its fullest advantage and that overall system safety assessments, based on test and flight data and on quantitative analyses, are not a part of the process of accepting critical failure modes and hazards. Since the Hazard Report does not provide a comprehensive risk assessment, or even an incle- pendent evaluation of the retention rationale stated in the input CTEs, we believe the overall process shown in Figure 5-2, representing NASA's current plans, has serious shortcomings. The isolation of the hazard analysis within NASA's risk assessment and management process to date can be seen as reflecting the past weakness of the entire safety organization. For that reason, this issue of the role of hazard analysis drives to the heart of our most sweeping conclusion, which is that the information flow, task descriptions, and functional responsibit- ities implied by Figure 5-2 must be mo(lifiec! if NASA is to achieve a truly effective risk manage- ment process. The reordering of functions which the Committee recommends is Ascribed in detail in Section 5.~. 50

OCR for page 40
6 rently too long, but a gradual reduction in flow . . tulle IS expecter to occur. Recommendations (9c): The Committee recommends that NASA main- tain its current intense attention toward reducing cannibalization of parts to an acceptable level. We further recommend that adequate funds for the procurement and repair of spare parts be made available by NASA to ensure that cannibalization is a rare requirement. Finally, we recommend that NASA include cannibalization, with its attendant removal and replacement operations, as a potential producer of failure in the integrated risk assessment recommended earlier (Section 5.~J. 5.10. OTHER WEAKNESSES IN RISK ASSESSMENT AND MANAGEMENT 5.10.1 The Apparent Reliance on Boards and Panels for Decision Making The multilayerec! system of boards and panels in every aspect of the STS may lead individuals to defer to the anonymity of the process ant! not focus closely enough on their individual responsibilities in the decision chain. The sheer number of STS-relatec! boards and panels seems to produce a mindset of "collective responsi- bility." The NSTS Program is a large organization whose mission involves the development, deployment, ant! operation of a complex space vehicle in a wide range of missions. Associated with each milestone in the development of any NASA space system ant] its constituent parts, or in the preparation for a space mission, are one or more reviews. These reviews may be made from the standpoint of requirements, engineering design, (levelopment sta- tus, safety, flight readiness, or resource require- ments. Conducting each review is a team, panel, or boarcl, which may or may not be permanently empanelecI. As described in Section 3.2.2, in the NSTS Program there are review groups at every level of management, including the contractor or- . . ganlzatlons. Figure 5-l ~ depicts the review groups associated with the NSTS FMEA/CIL and hazard analysis processes alone. There are also boards to review design requirements and certification, software, the Operations and Maintenance Requirements ant! Specifications Document (OMRSD) ant! the Op- erations and Maintenance Instructions (OMI), the Launch Commit Criteria, ant! mission rules. There are flight readiness reviews at each stage of prep- aration, with a Launch System Evaluation Advisory Team to assess launch conditions and a Mission Management Team to oversee the actual mission. The Committee cleveloped a concern about a possible attitudinal problem regarding the decision process on the part of the NASA personnel engaged in it. Given the pervasive reliance on teams and boards to consider the key questions affecting safety, "group clemocracy" can easily prevail, with the result that inclividual responsibility is diluted anti obscured. Even though presumably the chair- man of each group has official responsibility for the decision, most decisions appear to be highly participatory in nature. In a CCB review auditec! by the Committee, for example, there were 25-35 people present and the role of the chairman was not especially distinct. Each action appearec! to be a consensus action by the board. It is possible that this is a factor in the problem iclentified by the Rogers Commission: " . . . a NASA management structure that permitted internal flight safety problems to bypass key Shuttle managers" (Vol. I, p. 82~. For example, the Level II PRCB conducts daily and weekly meetings usually via teleconferencein which as many as 30 people participate. It is certainly conceivable that inclivicl- uals might be reluctant to express their views or objections fully under such circumstances. Also, passing ciecisions upwarc] through the ranks of review boards may reduce each chairman's sense that his decisions are crucial. As a case in point, it is clear from the report of the Rogers Commission, and from statements made to the Committee by NASA personnel involved, that the lines of au- thority and responsibility in the flight readiness review decision-making chain had become vague by the time of mission 51-~. In discussing this issue, NASA's Associate Ad- ministrator for SRM&QA pointed to the SR&QA directors at the field centers as the inclividuals with primary responsibility for the safety of the Shuttle system. They are said] to have full "responsibility, authority, and accountability." Nevertheless, these in(lividuals do make inputs to larger and higher boards, so that in the en(l all decisions become 68

OCR for page 40
- J J LL - 6 I tin aC o ~ a: ~ m I _ _. ~ _ _ ~ m 3 o c~ (I) O _ j;T o o o oom _ ~ ,2 ~ _ ~ ~ 0 - 1 -' in a) a' en Cal ~ 1 - 0 c C ~I~ c: o. I u, ' ~ U) a, g e E 2 ~ mO a' 'a ~ 0 ~ 0 ~ 0 ~ 0 o us in tar m o :~ 0 0 0 L) .= ct) a) . _ au a) . _ a) o in o . _ .cn a) o ~4 o Q cn a) Q o 1 ^. a) cn >~ cn a a >~ - o CO a . _ a) tn C! ~r 1 69

OCR for page 40
collective ones, lacking the crucial minciset of individual accountability. It is possible that a semantic problem is partly at fault here, in that NASA managers often refer to "the board" as being synonymous with its chairman, with respect to decision authority. Nevertheless, a mindset is thereby establisher] in which it is not clear whether these are inclivi`dual or group decisions. The Committee contrasted the NSTS system with tha. of the U.S. Air Force, in which the board (including its chairman) makes recommendations to the decision maker. One positive point in favor of NASA's system is that, there, the chairman (who is the decision maker) is requires] to listen "in public" to all dissenting views. The Committee recognizes the important role played by the many panels and hoards in the NSTS program in providing coordination, resolving prob- lems and technical conflicts, and reviewing and recommending actions. These entities allow the different interests and skill groups to bring forward their inputs, contribute their knowledge, and thus minimize the risk that a proposed action will negatively affect some aspect of the STS. Recommendation ( 1 Oa): The Committee recommends that the Adminis- trator of NASA periodically remind all NASA personnel that boards and panels are advisory in nature. He should specify the individuals in NASA, by name and position, who are responsible for making final decisions while considering the advice of each pane! and board. NASA management should also see to it that each individual involved in the NSTS Program is completely aware of his/ her responsibilities and authority for decision mak- ing. 5 10.2 Adequacy of Orbiter Structural Safety Margins The primary structure of the STS has been excluded, by definition, from the FMEA/CIL process, based on the belief that there is an adequate positive margin of safety. However, the Committee questions whether operating structural safety margins have actually been proven adequate. Completion of the Mode! 6.0 loads stucly and the reevaluation of margins of safety based on these loads will significantly improve NASA's grasp of actual operating margins of safety. NASA groundrules exclucle primary structure from the FMEA/CIL process. NASA has apparently assumed that the structural reliability of the STS Including the Orbiter, External Tank, ant] Solid Rocket Boosters) is close to 1.00, because the operating Toads are believer! to be less than the proof load to which the vehicle has been subjected. It is true that some structures have reliability approaching 1.00; examples include briclges, build- ings, and even commercial airliners. But there is a considerable difference between the Shuttle, a first- of-its-kinc! vehicle operated uncler unique condi- tions and challenging environments, and a com- mercial airliner, which is designed and tested to loads and conditions that are well understood. In addition, in the case of a commercial airliner the certifying agency (FAA) and operator organizations act as independent rule makers and aucTitors. No such indepenclent check and balance exists for the STS, where NASA controls all functions in-house (inclucling requirements, analysis methocis, testing, and certification)primarily within the NSTS pro- gram. The original development plans for the Orbiter- the most complex and vulnerable element, and the only manned element included a conventional structural test program for certification of the structural integrity. A complete, full-scale structural test article (an Orbiter vehicle) was to be incluclec! which was to be loaded to I.4 times the operating limit Toad in the most critical conditions. (This compares to the conventional value of I.5 used by the military and the FAA.) Due to budget problems NASA decided to eliminate one of the planner! flight vehicles and convert the static test article (#099, Challenger) to a flight vehicle after a series of proof tests to only 1.20 times the limit loacI. Some loading conditions actually clid not exceed] 1.15 times the limit load. Therefore, the tests die] not even verify a 1.4 strength margin over limit loads. Subsequent flight test data ant! calculations show that in some areas the maximum operating loads are actually 15/O to 20% higher than those originally postulatecI, so that the static proof load- ing tests clemonstrated only approximate limit conditions. Thus, today there is no clemonstrate 70

OCR for page 40
verification of safety margins for critical elements of the Orbiter. The mocle} of loads and stresses on the Orbiter used in its original design has been revised once. By 1983 even these data hac] become suspect, ant! another complete revision of loads using the latest test and analysis ciata was begun. Calculates] strength margins from this study (called Mode] 6.0) are expected to be available by November 1987. The Committee believes that the margin of actual strength over maximum expecter! limit loac! for critical areas of the Orbiter structure is not well known. Partly this is because loading conditions are complex and unprececlented, ant! partly it is because very little (if any) of the flight structure was actually tested to failure. The Committee agrees with the decision not to use the FMEA/CIL process on STS structures. However, we remain concerned about the uncertainty in the actual strength margins of safety. The Mocie! 6.0 loads calculation now nearing completion should correct the known clis- crepancies in external loads. Verification of the Mode! 6.0 loacis by data routinely gatherer! from an instrumentec! and calibrates] flight vehicle, be- ginning with the next flight, can help verify the mode! and establish the margins of safety more clefinitively. This knowledge will greatly improve NASA's ability to keep Shuttle operations within a safe envelope of structural loads. Implicit in the safe operation of any such struc- ture is a monitoring system to assure that deteri- oration of structural integrity floes not occur. An effort now underway could adc! materially to NASA's ability to operate the Orbiter's structure safely over its service life. People with airline experience, working uncier Rockwell International, are cleveloping a maintenance ant] inspection plan for the structure. A well-plannec! periodic inspec- tion of this sort is essential, and is the best preven- tive for unpleasant occurrences clue to structural deterioration or other causes. Recommendations (lOb): The Committee recommends that NASA place a high priority on completion of the Mode! 6.0 /:oads, the reevaluation of safety margins for these lo ads, and the early verification and continued monitoring of the morle! 6.0 loads by permanently instru- menting and calibrating at least the next full scale STS vehicle to fly. We further recommend that NASA complete and implement a comprehensive plan for conducting periodic inspection and main- tenance of the structure of the Orbiters throughout the service life of each vehicle. 5.10.3 Software Issues NASA FMEAs do not assess software as a possible cause of failure modes. There is little involvement of DISC Safety, Reliability and Quality Assurance in software reviews, resulting in little independent quality assurance for software. A large amount of data much of it flight specific must be loaded for each Shuttle mis- sion but it is not subjected to validation as rigorous as that for the software. The Shuttle onboard data processing system consists of five general purpose computers (GPCs) with their input and output devices, and memory units. Four of the five GPCs contain the primary software system, known as the Primary Avionics System Software (PASS); the fifth is a redundant computer which contains the Backup Flight System (BFS). The PASS is developed by IBM, and the BFS is built by Rockwell. In addition to flight software code, there are also flight software initialization data, called "I-Ioads", which are mission-unique parameter values. The basic code is reconfigured for specific missions, with about two such "reconfigured flight loads" per flight. After the software requirements are approved, three levels of development tests are performed leading to the First Article Configuration Inspection, or FACI. At the FACI milestone, the software package is handed off to the contractor's . ~ . . . . verlhcatlon organization or 1nc epenc lent testing, called Independent Validation and Verification (IV&V), which leads to the Configuration Inspec- tion (CI) and delivery to NASA. (The degree of independence of the IV&V was discussed in Section 5.8.) Following mission-specific reconfiguration and testing in the SAIL and other JSC laboratories, the package is ready for Flight Readiness Review. A Shuttle Avionics System Control Board (SASCB) is the I=eve! II flight software control board, to which the Program Requirements Control Board has delegated responsibility for software configu- ration control. The Manager of the NSTS Engi- neering Integration Office chairs this board and signs the flight readiness statement on software; thus he is the focus of configuration control and 71

OCR for page 40
management authority for software. At Level Ill there is a Software Control Board, corresponding to the Configuration Control Boars] for hardware issues. The testing, control, and performance of STS software seem quite goocI. Out of some half-million lines of code in the Shuttle flight software, typically an average of one error is cliscovered beyond the CI. With the emphasis placer] on early detection of errors, error rates are quite low throughout the total 10 million-line Shuttle software system. Only once has a software problem disrupted a mission (on STS-7, uncertainty about the effect of installed software code on a particular abort scenario causer! a launch scrub). Both the developers ant! the `'inclepenclent" certifiers perform their own inspec- tions of the cocle. Special "code audits" are also carried out to reinspect targetec! aspects of the code on a one-time basis, based on criticality, complex- ity, Discrepancy Reports (DRs), ant] other consi`~- erations. Software quality control includes weekly tracking of DRs through the Configuration Man- agement database (which tracks all faults, their causes and effects, and their disposition); trencis of DRs are reported quarterly. Although generally impressed with the Shuttle software development and testing process, the Committee made a number of specific finclings. First, we note that software is not a FMEA/CTE item. NASA personnel state that all software is consiclerect to be Criticality I, with each problem being fixer! as soon as it is cletected through testing ant! simulation. The Committee believes that iclen- tification ant] precliction of software faults or error modes may be feasible by dividing the software into functional modules ant] then considering the various possible failures (e.g., improper constants, cliscretes or algorithms, missing or superfluous symbols). There is little involvement of the ISC SR&QA organization in software reviews, due to the limi- tations on staff. As a result, there is little incle- pendent quality assurance for software. Finally, we note that a large amount of data much of it flight specific must be loader! for each Shuttle mission. However, the data ant! its entry are not validated with the same rigor as in the IV&V of the software. Recommendations (lOc): The Committee recommends that NASA: explore the feasibility of performing FMEAs on software, including the efficacy of identifying and predicting fault and error modes; request DISC SRdrQA to provide periodic review and oversight of software from a quality assurance point of view; provide for validation of input data in a manner similar to software validation and verification. 5.10.4 Differences in Procedures Among NASA Centers Differences in the procedures being usec! by the main NASA centers involved in the NSTS Program may reflect an imbalance between the authority of the centers and that of the NSTS Program Office. The Committee is con- cernec] that such an imbalance can leac! to serious problems in large programs where two or more centers have major roles in what must be a tightly integrates! program, such as the NSTS and Space Station. Without strong, central program direction and integration, the success and safety of these complex programs can be placed in jeopardy. In March 1986, the NASA Associate Aclminis- trator for Space Flight ant! the Manager of the Level Il NSTS Program issues] memoranda setting forth NASA's strategy for returning the Space Shuttle safely to flight status. Their orclers rescinder! all Criticality I, IR, and IS waivers anc! required that they be resubmitted for approval. The process also required the reevaluation of all FMEA/ClI s and retention rationales, as well as hazard analyses. Other instructions required that a contractor be selected for each STS element (that contractor not otherwise being involved in work on the element) to conduct an inclepenclent FMEA/CIL. No specific guidelines were issued by the NSTS Office for the conduct of the inclepenclent evaluations; the meth- ods to be used were determined by the NASA centers concernecl. Also, the FMEA/CIL reevalua- tions were initiated using pre-51 L FMEA/CTE in- structions, in which there were differences in ground rules between ISC and MSFC. (In October 1986, the NSTS Program Office issued new uniform instructions, NSTS 22206, for the preparation of FMEA/CILs, but it took several months for revised directions to reach the STS contractors.) Thus, some differences emerged in the nature and results of the reevaluation conducted] by different con- tractors. 72

OCR for page 40
; ~ /, These differences are especially noticeable with respect to the FMEA/CIL reevaluation procedures. The Committee found that, at MSFC, all contrac- tors had been instructed to conduct a new FMEA, "from scratch." At ISC, the independent contrac- tors were told to prepare a new FMEA, but the prime contractors were instructed to reevaluate the existing FMEA. At KSC, where E;MEAs are con- ducted only on ground support equipment, a single group (not the original designer) was reevaluating each category of FMEA, working with the existing FMEA. Procedures with respect to the independent reviews also differed. At MSFC, the independent contractor first performed its FMEA and developed any necessary retention rationales; it then com- pared those results with the FMEAs and retention rationales prepared by the prime contractor and wrote specific Review Item Discrepancies (RlDs) on points of difference or disagreement. At ISC, no RlDs were written and no retention rationales were prepared by the independent contractor. Fur- thermore, some Orbiter subsystems were initially excluded front the review. Initially, the Committee was concerned that these differences in procedure Night recluce the valiclirv and effectiveness of the F-MEA/CIL reevaluation process. However, an audit by the Committee of the documentation ant] review process used by ISC in the case of the Orbiter inclicatec! that it is a reasonable alternative to the RID process employed by MSFC. Nevertheless, the Committee suggested in its second interim report to NASA (see Appendix C) that the NSTS Program Office "review the FMEA/CTE reevaluation processes as implemented for each STS element to assure itself that any differences will not compromise the quality and completeness of the overall STS FMEA/CTE effort." This more specific concern for procedural dif- ferences led, moreover, to a broader concern over the nature of management control within NASA. Differences in procedures used by the NASA centers in this context and others (e.g., with respect to the independence of STS certification, as discussed in Section 5.8) lead the Committee to suspect that an imbalance may exist between the authority of the centers and that of the NSTS Program Office. The Committee is concerned that such an imbalance can lead to serious problems in large programs where two or more centers have major roles in what must be a tightly integrated program, such as the NSTS and Space Station. Without strong, central program direction and integration, the suc- cess and safety of these complex programs can be placed in jeopardy. Recommendation ( 103~: The Administrator should ensure that strong, central program direction and integration of all aspects of the STS are maintained via the NSTS Program Office. 5.10.5 Use of Non-Destructive Evaluation Techniques Non-destructive evaluation (NDE) tests on the Solid Rocket Motor (SRM) are performed at the manufacturing plant. Subsequent trans- portation ant] assembly introduce a risk of cleboncling and other damage which may not be apparent upon visual inspection. No NDE is done on the SRMs in the "stacked" config- uration at the launch facility. New NDE techniques now being developed have potential applicability to the STS. I'roble~ns have been cletected by NASA and its contractor on the STS Solid Rocket Motor (SRM) with clebonding between the propellant, liner, in- sulation, and case. In April 1986, a USAF Titan 34D (comparable in design to the SRM) experi- encecl a destructive failure shortly after launch, due to debonding. No such severe consequences have been seen from SRM debonding, but bone] line problems are nevertheless viewed as critical failure mocles, especially given the redesign of the SRM joints. Voids within the propellant mass are also of concern. Destructive inspection of the SRM (e.g., cutting and probing) is not feasible, so non-(lestruc- tive methods must be used. On the SRM, most of these tests are performed at the manufacturing plant; later transportation ancI assembly introduce a risk of deboncling and other damage which may be more difficult to detect at the launch site. There are essentially two issues here: the tech- niques employed and the location where inspection is clone. Shuttle SRM NDE assessment to date has employecl a combination of visual, ultrasonic, and radiographic techniques. The range of NDE tech- niques considered by NASA (but not necessarily tested) as of lanuary 1987 is shown in Table 5-~. According to NASA's Aerospace Safety Advisory Panel, acoustic and thermographic techniques are 7~

OCR for page 40
TABLE 5-1 Non-Destructive Evaluation Methods Considered By NASA Method Looks For Remarks Ultrasonics Unbonds: case/insulation, inhibitor/propellant, and propel- Propellant/liner to be confirmed. Iant/liner Radial radiography Propellant voids/inclusions Tangential Gapped unbends: Propellant/liner, flap bonds, and flap radiography bulb configuration Thermography Unbonds: case/insulation inhibitor/propellant, and propel- Limited experience base; lant/liner prop./liner to be confirmed Mechanical Unbonds: near joint end case/insulation Complex insulation geometry Oblique-light Gapped edge unbends: case/insulation and inhibitor/pro- Magnifies and automates visual video pellant unbend inspection Computed Gapped unbends: all intersecting interfaces, propellant Long term tomography voids/inclusions Holography Unbonds: near joint end case/insulation Excitation and scale concerns Acoustic emission Unbonds: case/insulation Long term (Source. NASA MSFC) thought to be those with the greatest near-term potential for improving NDE capabilities with respect to the SUM. Another promising group of techniques is based on X-ray technology. The USAF, in its Titan recovery program, has empha- sized NDE techniques including ultrasonic, ther- mographic, an(l X-ray.i Sin~ilar efforts are being pursuer! in the Navy's Triclent progran~. ~ ~ With respect to the issue of location, NASA has cletermined that the ' stackecI" configuration of tile SRM is not amenable to NDE of critical areas using available methods. However, NASA engi- neers believe that the assembly, rollout, and pad hoicI-down loads on the SRM will not cause de- boncling. Therefore, inspections are conducted at key processing points in the plant and at critical SRM segment locations before stacking at Kennecly Space Center. Nevertheless, the Committee remains concerned! about the possibility of damage resulting from transportation, assembly, and rollout. We recognize that NASA is (anal has been) paying serious attention to the NDE issue. However, we believe that the technologies are developing rapidly enough that continued close attention is warranted. Recommendation (lOe): The Committee recommends that NASA apply all practicable NDE techniques to the SRM at the launch facility, at the highest possible level of assembly (e.g., SRMs in the "stacked" configura- 12 NASA: Aerospace Safety Advisory Panel, Annual Report for 19X6 (February 1987). ~ Lt. Col. Frank Gayer, USAF Space Division, personal cc~mmunica- tion. i~ Dale Kenemuth, SP-273, Dept. of the Navy, personal communica- t~c~n. 74 tiong, and em phasize (levelo pment of im proved NDE methods. 5.~1 FOCUS ON RISK MANAGEMENT The current safety assessment processes used by NASA do not establish objectively the levels of the various risks associates! with the failure anodes an(l hazards. It is not reasonable to expect that NASA management or its panels and boards can provide their own detailed assessments of the risks associated with failure modes and haz- ards presented to them for acceptance. Validation and certification test programs are not planner! or evaluates! as quantitative inputs to safety risk assessments. Neither are operating conclitions ant! environmental con- straints which may control the safety risks adequately definer] and evaTuatecI. In the Committee's view, the lack of objec- tive, measurable assessments in the above areas hinders the implementation of an effective risk management program, including the reduction or elimination of risks. Throughout its audit the Committee was shown an extensive amount of information related to program flow charts, organizations, review panels and boards, information transmission, and reports. But the Committee did not become aware of an organization and safety-engineering methodology that could effectively provide an objective assess- ment of risk, as described in Section 4. Throughout the flow of NASA reports and approvals, both

OCR for page 40
before the 51-L mission ant] after, judgments are macie and statements of assurance given by persons at every level which are based on data and assertions having a wicle range of validity. The Committee believes that it is not reasonable to expect program management or NASA Level ~ management to provide its own in-clepth evaluation of presented hazarc! risks. Nor will other panels or boards be able to clo so without the necessary professional staff work being done. That work, in turn, cannot be performed without methods for assessing risk ant! controlling hazards. The methods must include tile establishment of criteria for design margins which are consistent with the acceptable levels of risk. The Associate Administrator for SRM&QA, in his new plan for management of NASA's SR&QA activities, stipulates that the SR&QA directors of the NASA centers are responsible for assuring the safety of their Center's products and services. However, we conclude that unless the safety or- ganizations at the centers have (~) the appropriate methodology an(l tools (both analysis programs and personnely, ant! (2) the authority to establish criteria for safety margins, specific requirements on verification test programs, environmental con- straints on operations, ant] total flight configuration validation, they cannot be held responsible for assuring an-.acceptable level of safety of flight systems. (In fact, they can never "assure safety," but only assure that the risks have been assessed objectively by approved methodologies, ant! that they are being controlled to the levels accepted by the appropriate NASA authorities.) Figure 5-12 shows that even in the current post- 51-L planning, the final result of the hazard analysis and safety assessment process is a NASA Space Shuttle Hazards Data Base. Having an approved list of accepted, identified hazards ant] a sophisti- cated closed-Ioop accounting and review system (the SNAP) may be useful. However, nearly every catastrophic accident since the beginning of the missile and space programs was caused by some aireacly-identifiec] hazarc! related to potential failure modes. The essence of safety-risk management, in the Committee's view, is not just the identification ant! acceptance of potential hazards, nor even the performance of a risk assessment for each failure mocle and hazard; it is getting control of the conditions which turn potential into real. The FMEAs, CILs, hazard reports, and safety assess- ments identify risks, summarize information, ref- erence data, provide status, etc. They do not analyze or establish the risk levels. Neither do they assess quantitatively the val:i(lity of the test programs in establishing failure margins, or (refine the operating conditions or environmental constraints which af- fect the risk levels. We believe that the key requirements and con- cepts contained in various relevant NASA clocu- ments (see Section 3, for example) provide a good overall framework within which a comprehensive systems safety and risk management program couIcl be cleaner! ant] implementecI. It is the opinion of the Committee that such a program wouic! require bringing together appropriate activities into a fo- cused "Systems Safety Engineering" (SSE) function at both Headquarters ant! the centers. This SSE function would apply across the entire set of (resign, development, qualification and certification, and operations activities of the NSTS. These activities would be an integral engineering element of the NSTS Program. They would involve more than just the preparation of reviews, reports, or data pack- ages. Instead, systems safety engineering would combine the functions of reliability and systems safety analysis. It should be responsible for defining the requirements and procedures, ant] performing or managing, as appropriate, at least the following functions which comprise the basis of a risk as- sessment and risk management system: I. Identification of failure mo(les and effects A. Establishment of design criteria for redun- dancy 3. Identification of hazards and their potential consequences 4. Identification of critical items 5. Evaluation of the probability of occurrence of causes and consequences of failure mo(les anc! hazards 6. Establishment of safety-risk level criteria for design margins ant] hazarc] controls 7. 8. Design of qualification and certification test programs . Objective assessment of safety risks 9. Development of acceptance rationale for retained hazards and hazard reports 10. Specification of environmental and operat- ing constraints at all levels (parts, subsystem, 75

OCR for page 40
Rae N PD 1 700 1 BASIC POLICY ON SAFETY NHB 1 700 1 (V 1 -A) BASIC SAFETY MANUAL _t NOR RlNn d 1 1 n 71 - NSTS 07700 DELIVERABLES 50 77-SH 0113 r RISD DELIVERABlES SR&OA AND MAINTAINABILITY PROVISIONS FOR THE SPACE SHUTTLE ' 1 ~ r ROCKWELL HA DES INSTRUCTION 400-24 NHB 1700 1 (V3) SYSTEM SAFETY METHODOLOGY NSTS 22254* METHODOLOGY OF CONDUCT OF NSTS HAZARD ANALYSIS NSTS 22206* INSTRUCTIONS FOR PREPARING FMEA CIL . I _ | CONDUCT HAZARD ANALYSIS |< FMEA CIL DOCUMENTS ~ ~ __ | PREPARED HAZARD REPORTS l l (KEY) _ RISD SHUTTLE HAZARDS INFORMATION MANAGEMENT ~ I RISD ERB SYSTEM (SHIMS) _ SUBSYSTEM MANAGER MISSION OPERATIONS DIRECTORATE FIGURE 5-12 JSC SR&QA). NSTS 0700 TECHNICAL nEQUIREMENTS ~ ~ JSC SAFETY | SYSTEM SAFETY SUBPANEL | | MISSION SAFETY ASSESSMENT 1 I ~ | PROJECT MANAGER r , ~ , ORBITER CONFIGURATION CONTROL BOARD (CC8 BASELINING)* | SENIOR SAFETY REVIEW BOARD l LEVEL II PRCB BASELINING .l * New procedures added since 51L NASA SPACE SHUTTLE HAZARDS DATA BASE PREVIOUS EXPERIENCE DESIGN ENGINEERING STUDIES . SAFETY ANALYSES . SAFETY STUDIES . CRITICAL FUNCTIONS ASSESSMENTS FMEA S CIL S CERTIFICATION PROGRaM . SNEAK ANALYSES . MILESTONE REVIEW DATA RIO S . PANEL MEETINGS . CHANGE EVALUATIONS FAILURE INVESTIGATIONS . WAIVERS DEVIATIONS . 0MRS0 S OMI S WALKDOWN INSPECTIONS MISSION PLANNING ACTIVITIES . FLIGHT ANOMALIES ASAP INPUTS . INDIVIDUAL INPUTS . PAYLOAD INTERFACES NASA NSTS safety analysis, Hazard Reports, and safety assessment process in 1987 (NASA element, and system) to assure that valiclatec} margins are not violates] 11. Quantitative evaluation of flight data to update safety margin validations 12. Oversight of quality assurance functions to control safety risks 76 13. Overall system safety risk assessment and definition of the potential to reduce the level of risk. All of the above systems safety engineering func- tions (elaborate(l upon in Appendix F) are necessary both for achieving creclible risk assessment and for

OCR for page 40
defining the risk controls required to justify ac- ceptance of critical failure mocles and other hazards. During design ant] development, the quantitative evaluation of relative risks for each design against acceptable criteria for levels of risk should be considered as an integral part of the systems en- gineering activity. These activities also wouIc] pro- vide a definitive basis for establishing the design margins and operational constraints neecled to reduce the overall risk to the accepted] level anc! subsequently control the risk. Function 13 above Definition of the potential to reduce the level of risk) is an essential input to risk management. The Committee has the impression that changes to the STS often are considered only if they will improve its performance or reduce risks to that level which has previously been accepted in the program. The Committee believes that such risks, accepted in the past, logical as that may have appeared to be at the time, shout not continue to be accepted without a concentrated effort to plan ant] implement a program to remove or reduce these risks. The magnitude of the preceding tasks point to the neec! for a large number of highly qualified professional systems safety engineers (i.e., systems engineers with a safety orientation) at NASA anc! at its major contractors. We were disturbed to learn from rhe Director of the Safety Division at Headquarters SRM&QA that, as of April 25, ~ 987, he had only one professional systems safety engi- neer in his division, and that he expects to add only two more in the near term and four additional ones in the Tong term. It is troubling to the Committee that this important and extremely com- plex systems engineering function should be so severely constrained by staff limitations, in light of the cost of the Shuttle ant] the risk to its crew. Taken together, the tasks listec! above have the highest leverage on overall risk assessment and the control of the causes of hazard. Only professionally dedicated systems safety engineers working to- gether can develop the expertise and motivation to carry out these functions properly. They can per- form their control of validation and certification programs in an objective way (if not functionally assignee] to program organizations). The need for independent entities to perform certification and software IV&V to provide substantiation and con- fidence was discussed in Section 5.8. This risk- managed approach to the validation and certifi- cation functions, including the feedback of flight data, shouIc! not be done by those responsible for design ant] development. They are performance orientecl; they generally do not design hardware configurations to facilitate margin validation, and their proposed certification programs usually are not oriented to the demonstration of failure mar- gins. Finally, it seems to the Committee that it is not managerially reasonable to make an organization responsible for holding system safety to an agreed level of risk without according it responsibility and authority over all of the above functions, which actually control the risks. Another major element of an overall risk man- agement program is the quality assurance (QA) function. Quality assurance certifies that the har`cl- ware anc! software have been procluced to the exact designs which clescribe the vaiiciateci ant! quail system. The "configuration" includes all aspects of the hardware and software, including the environ- ments which in any way influence the properties of materials, stress margins, or temporal behavior of parts, subsystems, ant] elements. In 1986, responsibility for policy and oversight of the quality assurance function was assignee! to the new office of the Associate Administrator for SRM&QA. (his is appropriate, because overall risk management anct total systems safety are clependent on the quality assurance function throughout NASA. The QA function shouicl be performed separately from the systems safety en- gineering functions (although there is certainly a strong oversight interaction between the two). Quality assurance should be a responsibility of each NASA center (and, of course, each contractor). Its purpose is not to design but to control ant] assure. As part of this function it shouIcl eontro! the entire set of final released engineering cloeu- ments describing the complete configuration of the system. As the Committee unclerstands it, that is precisely NASA's current practice. Recommendations (11~: The Committee recommends that NASA con- sider establishing a focused agency-wicle Systems Safety Engineering (SSEJ function, at both Head- quarters anti the centers, which would: be structured so as to be integrally involved" in the entire set of (resign, development, valirlation, qualification, anti certification activities; provide a full systems approach to the continuous 77

OCR for page 40
l identification of safety risks (not just failure modes and hazardsJ and the objective (quanti- tative) evaluation of such safety risks; provide the output of this function to the NASA Program Directors in support of their risk man- agement; support the Program Directors by providing assurance that their systems are ready for final G safety certification to the risk levels established by the NASA Administrator. The Committee also recommends that the STS risk management program, based in part on the definition of the potential to reduce the level of risk (levelope(1 by the system safety risk assessment, include a concerted effort to remove or reduce the risks. 78