| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 73
APPENDIX B
LETTER REPORT TO GWENDOLYN S. KING
June 15, 1990
The Honorable Gwendolyn S. King
Commissioner
Social Security Administration
Department of Health and Human Services
6401 Secunty Boulevard
Baltimore, Maryland 21235
Dear Commissioner King:
The Department of Health and Human Services has asked the National Research
Council to conduct a two-phase review of the Social Security Adm~n~stration's (SSA)
information systems modernization and agency strategic plans. Our committee's review
began in September 198S, and thus far we have met 12 times, including two workshops.
On April 3, 1989, we issued a letter report that responded to former Commissioner Dorcas
R. Hardy's request for accelerated advice regarding the agency's progress toward systems
modernization. In February 1990, we issued a full report on the first phase of our study
entitled "Systems Modernization and the Strategic Plans of the Social Security
Administration." We are planning to issue a full report on the second phase of our study
in November 1990.
This letter report deals with the subject of backup and recovery from a disaster at
SSA's National Computer Center (NCC). A disaster could so seriously damage the NCC
that its major operations would have to be established elsewhere. SSA's current disaster
recovery plans provide for restoration of tape-based batch processing only, at a commercial
'Dot site" sized to handle about one-third of that workload. This present arrangement will
not provide on-line support for any of the agency's functions, and we are convinced that
it will be impossible to revert to manual paper-based systems should the NCC be lost.
Therefore, we believe that the present hot site arrangement is an unacceptable choice for
backup and recovery unless it is supplemented with communications to field sites.
We have previously reported on this issue. In our letter report of April 3, 1989, we
stated that a 'loss of the NCC for any reason would significantly reduce the agency's ability
to serve the public." In our phase ~ report, we devoted a section to the subject of
73
OCR for page 74
\
74
"Continuing of Service" and expanded on our concern that the backup and recovery plans
for the NCC are inadequate. We stated: "Since the NCC has become such a critical
element in the SSA's ability to serve its clients, we recommend that: the Social Security
Administration immediately develop a workable strategy for surviving a partial or major
loss at the National Computer Center." During our review, we found that this concern was
shared by SSA's technical managers and senior executives.
In February 1990, at the agency's request, our committee's charge was extended by
contract modification to include a review and letter report on the approaches SSA might
take in planning and selecting a workable disaster recover strategy. We began this new
task by forming a subcommittee to meet separately with agency technical experts and
managers, to review their plans, and to gather the background information needed for our
review. This letter report summarizes the major issues, suggests strategic goals, identifies
the primary alternatives, advises on the steps to take in planning for disaster recovery, and
recommends that: SSA build a second data center, much smaller that the NCC, to share
some of the processing load and provide for limited recovery of operations at either site.
All Functions vs. Critical Functions
The Social Security Administration should limit its disaster recovery strategy to
a chosen set of critical functions rather than planning to back up ad of its
processing functions, because fuR backup is impractical
SSA managers have based SSA's disaster recovery plans on the assumption that it
will continue to perform virtually all: of its functions, albeit more slowly. The SSA's
Disaster Recovery Plan - NCC Critical Operations (undated), which we reviewed, states:
'7he follomng operations have been designated as critical . . . Post-Entitlement, Claims,
SST, Enumeration, Earnings Record Maintenance, and Black Lung."
Following the initial meeting of our subcommittee, we asked SSA managers to
develop a limited list of functions it must continue to fulfill even if the NCC were lost.
Thus far, SSA managers have not been able to decide which functions are critical or,
conversely, which functions they will curtail or suspend following a loss of the NCC. For
example, a draft SSA white paper (dated 12/15/89) provided for our review lists four
options for backup and recovery, but each one specifies performing 100 percent of the
programmatic workload.
At our urging, SSA made a preliminary effort to identity a reduced list of workload
priorities and deferrable workloads, but the results presented to us were offered as tentative
and not definitive of the agency's plans.
ted _ ~ ~ . 1 , ~ ~ · · ~ ~ ~ . · ~ ~ ~
.
Electing runctlons tnat VIA Will suspend or curtail during an emergency runs against
the agency's culture, which is rooted in its public service mission. It can also provoke turf
battles within the agency over which functions are more important than others. Also, such
selections are rife with political implications that dew logical determination and usually
change with the priorities of the administration. Given the internal and external politics
and the technical impediments of the current software, a strategy to run everything, but
more slowly, may seem to be a good compromise because it avoids hard decisions, but it
is costly because it requires greater processing capacity.
Thus far, SSA has not been able to select a critical subset of its workload that can
OCR for page 75
75
be processed quickly on a modest hardware platforms While we were not able to
determine the exact reason for this difficulty, the intertwined nature of SSA's software
processing modules, and ache complexion of the program laws themselves, appear on the
surface to be responsible. However, such factors are faced by many organizations with
integrated software and do not relieve management's obligation to make difficult choices.
Furthermore, we are opposed to SSA's developing a customized system for disaster
conditions or redesigning its present systems with the exclusive goal of allowing them to be
partitioned in an emergency. A customized system wall be too difficult to keep up-to-date,
and a full redesign should serve operational objectives as well.
In a rare emergency situation, SSA's clients can be expected to be understanding and
tolerant of delays for most of the agengy's services; however, we believe that it is critical
for the checks to go out on time and for major changes affecting payments to be processed
(e.g., starting and stopping them), even if accuracy suffers. Most other routine interactions
between SSA and its clients may be justifiably deferred during an emergency.
During this study, we were given a decision memorandum dated March 22, 198S, in
which the SSA reviewed and selected its long-term contingency plan for the backup and
recovery of its computer operations. Via this memorandum, SSA decided that it would
continue to pronde for backup and recovery using a contractor-fu~shed hot site in lieu
of SSA-owned facilities. Interestingly, this decision stipulated that the SSA's selected
backup strategy would provide for niinimum processing capacitor rather than a full-capaci~
backup. But this minimum requirement was described In the decision memorandum as
"those operations necessary for the agency to carry on its critical work, basically: processing
new claims, making postentidement changes which affect the check continuation of critical
oavment Drocedures. and performing certain critical administrative and financial Processes."
. , ~ , . in, .
~' ~ ~~ ~ ~ ~ . ~ ,~ . , . , ~ ~ ~ '~ , ~ . , ~ . ,
clearly, LEA has also recognized the imperative to plan tor lull automation support or JUSt
its critical functions in an emergency -- the issue is selecting which workloads are critical.
Database Integrity
The Social Secured Administration should ensure the integrity of its databases
following a disaster because it may be impossible to restore a database that has
become incomplete and inaccurate
In our deliberations on backup, one theme that kept emerging as a vital issue was
maintaining the integrator of the database. Following a loss of the NCC, the accuracy and
synchronization of SSA's database watt be quickly jeopardized because a backlog of
transactions can accumulate beyond the agency's ability to assimilate and eventually process
them in an orderly fashion. Multiple changes to the same records and changes from
different sources can result in a loss of database current y and backlog that wall eventually
be irrecoverable. For example, if SSA has to revert to reading tune tags to determine the
proper sequence of transaction processing, this could be a sign that the battle has been lost.
Of course, this potential problem can be averted if the programmatic functions are rapidly
and effectively restored and an untenable backlog avoided.
In our phase ~ report, we reco~runended that the SSA develop an effective disaster
recovery plan, responsive to its needs. But, if such a plan is not in place, the undesirable
consequence may be to suspend or severely curtail operations so that the backlog is held
OCR for page 76
76
to manageable levels. In other words, SSA may be confronted with choosing between
"closing shop" or risking loss of database integrity following a disaster at the NCC. A data-
capture scenario that completes only the data-entry phase of transactions in a distributed
processing environment, or records the data on paper, until the NCC comes back on-line
would seem to be an attractive possibility. In fact, however, it can create very awkward
database recovery problems or make accurate database recovery impossible. It may even
accelerate the buildup of the deferred processing backlog.
Strategic Goals for Disaster Recovery
The Social Security Acimin~stration should explicitly identify the goals and
objectives that its disaster recovery plarl must satisfy, because this will facilitate
systematic and defensible planning.
We recommend that SSA plan to achieve, as a minimum, the following two goals for
its disaster recovery plan:
Continuity of critical programmatic functions.
Maintenance of database integrity.
By critical functions, we mean a subset of all programmatic functions normally
performed (probably no more than half the normal functions). The underlying intention
is to spend minimally to support only the critical functions, with the assumption that the
mitigating circumstances of an emergency will permit and justify this reduction in service.
Most, if not all, of the financial, administrative, and software development functions should
be regarded as noncritical and may be suspended or severely curtailed. Major
programmatic functions such as adding or deleting a beneficiary should be continued
because they are vital to the agency's clients and have an impact on the trust fund.
Specifically, an effective strategy for backup and recoverer should include the
following objectives:
To provide an appropriate level of protection and level of service.
To fit and build upon SSA's technical, operational, and business environment.
To satisfy realistic cost constraints.
To be implementable in a reasonable time frame.
To avoid risky technical designs.
Choosing an approach for backup and recovery is not unlike buying insurance. The
three major factors to consider and balance are:
What is at risk (e.g., replacement cost)?
What are the threats and their likelihood of occurrence?
What is the cost of various levels of protection?
OCR for page 77
77
This type of problem ultimately comes down to selecting an acceptable level of risk. It is
not a mathematically `deterministic problem. Judgment must be applied and trade-offs
made. SSA can increase its level of protection while paying only for the protection (like
term insurance) by expanding its hot site provisions for capacity and communications.
Alternatively, it can increase its protection and also enhance ADP operations via a second
site arrangement (like whole life insurance). The choice is not black and white. Currently,
SSA is paying for and getting a less than adequate level of protection. We believe that this
is not prudent given what is at risk and the potential for disaster.
Alternatives
Three broad alternatives cover the range of bach~p and recovery strategies.
Our phase ~ report lists several alternatives available to SSA. Before reaching
consensus on a favored alternative that the agency should adopt for improved disaster
recovery, we considered the following major options:
Commercial Hot Site
The first and most pressing concern that must be addressed is what the SSA will do
immediately following a disaster. In the hot site alternative, SSA must plan to ',bridge" the
initial period following a loss of the NCC: until a more suitable facility can be acquired or
the NCC restored. As long as this initial period is adequately covered, we believe that it
wall allow SSA the time to locate a cold site and acquire the hardware and communications
to equip it for supporting sustained future operations. We estimate that this initial period,
before a cold site can be brought up, should be no longer than 60 days in the current
market for such facilities. However, market conditions can change and the lead time for
acquiring hardware, communications, and a suitable facility cannot be assured or expected
to remain constant.
Choice of this alternative assumes in our thinking that the present commercial hot
site arrangement wall be supplemented to incorporate emergency rerouting of
communications to it in order to assure support of the agency's 29,000 on-line terminals.
SSA Second Data Center
SSA can build a second data center. This alternative raises a number of questions,
including: what data we be kept there, how will it operate with the NCC, what processing
will it perform, what will its capacity be to assume all or some functions, how will it be
staffed, and how will it operate? There are at least two possible variations to this
alternative:
1.
lit will process only administrative, decision support, and software development
functions.
2. It will conduct full bicentralized operation with split database (e.g., by Social
Security number) and programmatic processing.
OCR for page 78
78
Distributed Processing
SSA can employ expanded and distributed data and processing, for example, at the
Program Service Centers (PSCs) or locally at the district offices, to render the agency less
dependent on the NCC. This approach does not, however, make the SSA independent of
the NCC and its centralized databases and processing. We do not support this alternative
for disaster recovery. As an alternative information systems architecture, it has merit in the
long-term evolution of SSA's systems but will still require that an effective backup and
recovery approach be in place for the centralized databases.
Recently, we learned of SSA's "roll-down strategy' to relocate replaced NCC
mainframes to the sex PSCs. This strategy has a beanug on the backup and recovery issue
and interjects a new set of planning considerations. For one, the new regional processing
centers also need their own disaster recovery plans and do not mitigate the need for backup
and recover of the centralized databases. Also, the functions and data at the PSCs have
a bearing on the NCC functions to be restored and the facility required. We believe that
the roll-down strategy will not enhance backup and recovery because the regional centers
will not be capable of backing up the NCC's functions and data and will increase the
complexity of the problem because of the greater number of sites. Furthermore, we have
concerns regarding operational, management and control, and cost issues associated with
such a strategy, which are not the subject of this letter report but we be addressed in our
phase 2 report.
Planning Approach
The Social Security Administration should systematize its planning approach as
suggested below to provide a sound basis and justification for Recision
Our role was necessarily limited to monitoring and reviewing new developments and
ideas' interacting with SSA's analysts and managers, and helping to facilitate a direction and
focus to SSA's disaster recovery planning. To date, this process has not progressed
sufficiently, and the agency is still groping with this issue and how best to approach it for
the long term. This should not be construed necessarily as a criticism of SSA's resources
or ability to do the job but more accurately as a consequence of the difficulty and
complexity of such decisions.
To facilitate further progress toward generating a workable disaster recovery plan
for the long term, we suggest that the SSA take the following actions now:
I. Determine the critical set of functions that are essential for survival and must
be performed following a disaster at the NCC. Typically, this is no more than
half of an organization's operational workload.
2. For the critical set of functions, determine what computing resources
(processing capacity, disk storage, and communications) are needed to
perform them. The intention is to identify a minimum technical facility that
the agency will need for disaster recovery.
OCR for page 79
79
3. Determine the time criticality of functions to be performed during an
emergency (e.g., reduced specifications on the levels of serviced to determine
how long the agency can do without the critical functions being performed.
This will establish periodic of computer runs and the time frame for
reestablishing operations. This is also related to the workload volumes for
the critical functions that wall be encountered following a disaster, that is, how
fast a backlog is likely to accumulate and how large it can get before recovery
itself is jeopardized.
Determine whether or not the software can be partitioned during an
emergency to support a subset of functions or if all functions must continue
to be supported.
Set a maximum time frame of 12 months for continuing to operate with the
present disaster recovery plan.
In addition, but of less immediacy, SSA needs to develop realistic capacity forecasts
(with a high degree of confidence), which the current estimates lack. It needs to consider
the overall system architecture that wall exist in the near term and be prepared to
periodically reassess and adjust its disaster recovery plans as the system evolves. There are
other questions to address, such as: the correct periodicity and procedures for saving data
and programs off-site; the criteria for defining a disaster; the most important factors to
consider in making a decision on SSA's backup and recovery strategy; and how to weigh
cost against risk.
The actions listed above suggest the most immediate steps that SSA should take in
producing an appropriate disaster recovery plan. As its plans develop further, additional
details and actions will be required. This planning approach should also provide a sound
basis for the decisions reached as well as the underlying justification.
Summary Recommendation
Our preferred alternative is for the SSA to build a second data center to share
some of the processing load and to provide for limited recovery of operations at
either site.
Even though we appreciate that many factors are not yet deterministic and that
many details are unresolved, we believe on balance that a preferred choice is apparent.
We recommend that SSA adopt the second site approach to satisfy its backup and recovery
needs.
In this approach, SSA would establish a second data center and operate it with a
minimal support staff. We believe this second data center should be much smaller and
more modestly equipped than the NCC. A second site provides for improved operational
capacity as well as an appropriate level of backup and recovery. It will be available to
pick up workload from a troubled NCC and does not necessitate a go/no-go decision as
a commercial hot site does. There are many private and governmental entities that operate
multiple data centers to assure that the technical challenge is manageable. This approach
fits and builds upon SSA's centralized architecture without requiring a risly and costly
departure from it. It can readily support the agency~s operational and business environment
OCR for page 80
80
because it does not impose changes on it. It can be implemented in a reasonable time
frame that is driven more by budget considerations than by technical difficulty or schedule
risk in development.
The second site approach is more costly than the present commercial hot site, but
the commercial hot site costs will increase when the essential communications are added.
We believe that the costs of a second site are not prohibitive, especially when considering
the effect of payment errors on the trust fund. In addition, operational benefits such as
improved systems response and greater system expandability can serve to offset some of the
increased costs.
Therefore, it is our consensus that the preferred approach for backup and recovery
and long-term operations is for the SSA to build a second data center to share its
processing burden and provide for limited recovery of operations at either site should the
other be lost for an extended period. Much still remains to be done, however, to determine
the associated technical and development details, the mode of operation of the second
center, and testing to assure that the critical workload can be operated from either site.
Willis H. Ware, Chairman
Committee on Review of the SSA's Systems
Modernization Plan and Agency Strategic
Plan
Representative terms from entire chapter:
hot site