hn-classics/_stories/1986/15426562.md

---
created_at: '2017-10-08T01:26:46.000Z'
title: Personal Observations on Reliability of Shuttle (1986)
url: https://history.nasa.gov/rogersrep/v2appf.htm
author: michaelsbradley
points: 109
story_text: 
comment_text: 
num_comments: 15
story_id: 
story_title: 
story_url: 
parent_id: 
created_at_i: 1507426006
_tags:
- story
- author_michaelsbradley
- story_15426562
objectID: '15426562'
year: 1986

---
**Report of the PRESIDENTIAL COMMISSION on the Space Shuttle Challenger
Accident**

** **

**Volume 2: Appendix F - Personal Observations on Reliability of
Shuttle**

by R. P. Feynman

** **

** **

**Introduction**

\[**F1**\] It appears that there are enormous differences of opinion as
to the probability of a failure with loss of vehicle and of human life.
The estimates range from roughly 1 in 100 to 1 in 100,000. The higher
figures come from the working engineers, and the very low figures from
management. What are the causes and consequences of this lack of
agreement? Since 1 part in 100,000 would imply that one could put a
Shuttle up each day for 300 years expecting to lose only one, we could
properly ask "What is the cause of management's fantastic faith in the
machinery?"

We have also found that certification criteria used in Flight Readiness
Reviews often develop a gradually decreasing strictness. The argument
that the same risk was flown before without failure is often accepted as
an argument for the safety of accepting it again. Because of this,
obvious weaknesses are accepted again and again, sometimes without a
sufficiently serious attempt to remedy them, or to delay a flight
because of their continued presence.

There are several sources of information. There are published criteria
for certification, including a history of modifications in the form of
waivers and deviations. In addition, the records of the Flight Readiness
Reviews for each flight document the arguments used to accept the risks
of the flight. Information was obtained from the direct testimony and
the reports of the range safety officer, Louis J. Ullian, with respect
to the history of success of solid fuel rockets. There was a further
study by him (as chairman of the launch abort safety panel (LASP)) in an
attempt to determine the risks involved in possible accidents leading to
radioactive contamination from attempting to fly a plutonium power
supply (RTG) for future planetary missions. The NASA study of the same
question is also available. For the History of the Space Shuttle Main
Engines, interviews with management and engineers at Marshall, and
informal interviews with engineers at Rocketdyne, were made. An
independent (Cal Tech) mechanical engineer who consulted for NASA about
engines was also interviewed informally. A visit to Johnson was made to
gather information on the reliability of the avionics (computers,
sensors, and effectors). Finally there is a report "A Review of
Certification Practices, Potentially Applicable to Man-rated Reusable
Rocket Engines," prepared at the Jet Propulsion Laboratory by N. Moore,
et al., in February, 1986, for NASA Headquarters, Office of Space
Flight. It deals with the methods used by the FAA and the military to
certify their gas turbine and rocket engines. These authors were also
interviewed informally.

** **

**Solid Fuel Rockets (SRB)**

An estimate of the reliability of solid rockets was made by the range
safety officer, by studying the experience of all previous rocket
flights. Out of a total of nearly 2,900 flights, 121 failed (1 in 25).
This includes, however, what may be called, early errors, rockets flown
for the first few times in which design errors are discovered and fixed.
A more reasonable figure for the mature rockets might be 1 in 50. With
special care in the selection of parts and in inspection, a figure of
below 1 in 100 might be achieved but 1 in 1,000 is probably not
attainable with today's technology. (Since there are two rockets on the
Shuttle, these rocket failure rates must be doubled to get Shuttle
failure rates from Solid Rocket Booster failure.)

NASA officials argue that the figure is much lower. They point out that
these figures are for unmanned rockets but since the Shuttle is a manned
vehicle "the probability of mission success is necessarily very close to
1.0." It is not very clear what this phrase means. Does it mean it is
close to 1 or that it ought to be close to 1? They go on to explain
"Historically this extremely high degree of mission success has given
rise to a difference in philosophy between manned space flight programs
and unmanned programs; i.e., numerical probability usage versus
engineering judgment." (These quotations are from "Space Shuttle Data
for Planetary Mission RTG Safety Analysis," Pages 3-1, 3-2, February 15,
1985, NASA, JSC.) It is true that if the probability of failure was as
low as 1 in 100,000 it would take an inordinate number of tests to
determine it ( you would get nothing but a string of perfect flights
from which no precise figure, other than that the probability is likely
less than the number of such flights in the string so far). But, if the
real probability is not so small, flights would show troubles, near
failures, and possible actual failures with a reasonable number of
trials. and standard statistical methods could give a reasonable
estimate. In fact, previous NASA experience had shown, on occasion, just
such difficulties, near accidents, and accidents, all giving warning
that the probability of flight failure was not so very small. The
inconsistency of the argument not to determine reliability through
historical experience, as the range safety officer did, is that NASA
also appeals to history, beginning "Historically this high degree of
mission success..." Finally, if we are to replace standard numerical
probability usage with engineering judgment, why do we find such an
enormous disparity between the management estimate and the judgment of
the engineers? It would appear that, for whatever purpose, be it for
internal or external consumption, the management of NASA exaggerates the
reliability of its product, to the point of fantasy.

The history of the certification and Flight Readiness Reviews will not
be repeated here. (See other part of Commission reports.) The phenomenon
of accepting for flight, seals that had shown erosion and blow-by in
previous flights, is very clear. The Challenger flight is an excellent
example. There are several references to flights that had gone before.
The acceptance and success of these flights is taken as evidence of
safety. But erosion and blow-by are not what the design expected. They
are warnings that something is wrong. The equipment is not operating as
expected, and therefore there is a danger that it can operate with even
wider deviations in this unexpected and not thoroughly understood way.
The fact that this danger did not lead to a catastrophe before is no
guarantee that it will not the next time, unless it is completely
understood. When playing Russian roulette the fact that the first shot
got off safely is little comfort for the next. The origin and
consequences of the erosion and blow-by were not understood. They did
not occur equally on all flights and all joints; sometimes more, and
sometimes less. Why not sometime, when whatever conditions determined it
were right, still more leading to catastrophe?

In spite of these variations from case to case, officials behaved as if
they understood it, giving apparently logical arguments to each other
often depending on the "success" of previous flights. For example. in
determining if flight 51-L was safe to fly in the face of ring erosion
in flight 51-C, it was noted that the erosion depth was only one-third
of the radius. It had been noted in an \[**F2**\] experiment cutting the
ring that cutting it as deep as one radius was necessary before the ring
failed. Instead of being very concerned that variations of poorly
understood conditions might reasonably create a deeper erosion this
time, it was asserted, there was "a safety factor of three." This is a
strange use of the engineer's term ,"safety factor." If a bridge is
built to withstand a certain load without the beams permanently
deforming, cracking, or breaking, it may be designed for the materials
used to actually stand up under three times the load. This "safety
factor" is to allow for uncertain excesses of load, or unknown extra
loads, or weaknesses in the material that might have unexpected flaws,
etc. If now the expected load comes on to the new bridge and a crack
appears in a beam, this is a failure of the design. There was no safety
factor at all; even though the bridge did not actually collapse because
the crack went only one-third of the way through the beam. The O-rings
of the Solid Rocket Boosters were not designed to erode. Erosion was a
clue that something was wrong. Erosion was not something from which
safety can be inferred.

There was no way, without full understanding, that one could have
confidence that conditions the next time might not produce erosion three
times more severe than the time before. Nevertheless, officials fooled
themselves into thinking they had such understanding and confidence, in
spite of the peculiar variations from case to case. A mathematical model
was made to calculate erosion. This was a model based not on physical
understanding but on empirical curve fitting. To be more detailed, it
was supposed a stream of hot gas impinged on the O-ring material, and
the heat was determined at the point of stagnation (so far, with
reasonable physical, thermodynamic laws). But to determine how much
rubber eroded it was assumed this depended only on this heat by a
formula suggested by data on a similar material. A logarithmic plot
suggested a straight line, so it was supposed that the erosion varied as
the .58 power of the heat, the .58 being determined by a nearest fit. At
any rate, adjusting some other numbers, it was determined that the model
agreed with the erosion (to depth of one-third the radius of the ring).
There is nothing much so wrong with this as believing the answer\!
Uncertainties appear everywhere. How strong the gas stream might be was
unpredictable, it depended on holes formed in the putty. Blow-by showed
that the ring might fail even though not, or only partially eroded
through. The empirical formula was known to be uncertain, for it did not
go directly through the very data points by which it was determined.
There were a cloud of points some twice above, and some twice below the
fitted curve, so erosions twice predicted were reasonable from that
cause alone. Similar uncertainties surrounded the other constants in the
formula, etc., etc. When using a mathematical model careful attention
must be given to uncertainties in the model.

** **

**Liquid Fuel Engine (SSME)**

During the flight of 51-L the three Space Shuttle Main Engines all
worked perfectly, even, at the last moment, beginning to shut down the
engines as the fuel supply began to fail. The question arises, however,
as to whether, had it failed, and we were to investigate it in as much
detail as we did the Solid Rocket Booster, we would find a similar lack
of attention to faults and a deteriorating reliability. In other words,
were the organization weaknesses that contributed to the accident
confined to the Solid Rocket Booster sector or were they a more general
characteristic of NASA? To that end the Space Shuttle Main Engines and
the avionics were both investigated. No similar study of the Orbiter, or
the External Tank were made.

The engine is a much more complicated structure than the Solid Rocket
Booster, and a great deal more detailed engineering goes into it.
Generally, the engineering seems to be of high quality and apparently
considerable attention is paid to deficiencies and faults found in
operation.

The usual way that such engines are designed (for military or civilian
aircraft) may be called the component system, or bottom-up design. First
it is necessary to thoroughly understand the properties and limitations
of the materials to be used (for turbine blades, for example), and tests
are begun in experimental rigs to determine those. With this knowledge
larger component parts (such as bearings) are designed and tested
individually. As deficiencies and design errors are noted they are
corrected and verified with further testing. Since one tests only parts
at a time these tests and modifications are not overly expensive.
Finally one works up to the final design of the entire engine, to the
necessary specifications. There is a good chance, by this time that the
engine will generally succeed, or that any failures are easily isolated
and analyzed because the failure modes, limitations of materials, etc.,
are so well understood. There is a very good chance that the
modifications to the engine to get around the final difficulties are not
very hard to make, for most of the serious problems have already been
discovered and dealt with in the earlier, less expensive, stages of the
process.

The Space Shuttle Main Engine was handled in a different manner, top
down, we might say. The engine was designed and put together all at once
with relatively little detailed preliminary study of the material and
components. Then when troubles are found in the bearings, turbine
blades, coolant pipes, etc., it is more expensive and difficult to
discover the causes and make changes. For example, cracks have been
found in the turbine blades of the high pressure oxygen turbopump. Are
they caused by flaws in the material, the effect of the oxygen
atmosphere on the properties of the material, the thermal stresses of
startup or shutdown, the vibration and stresses of steady running, or
mainly at some resonance at certain speeds, etc.? How long can we run
from crack initiation to crack failure, and how does this depend on
power level? Using the completed engine as a test bed to resolve such
questions is extremely expensive. One does not wish to lose an entire
engine in order to find out where and how failure occurs. Yet, an
accurate knowledge of this information is essential to acquire a
confidence in the engine reliability in use. Without detailed
understanding, confidence can not be attained.

A further disadvantage of the top-down method is that, if an
understanding of a fault is obtained, a simple fix, such as a new shape
for the turbine housing, may be impossible to implement without a
redesign of the entire engine.

The Space Shuttle Main Engine is a very remarkable machine. It has a
greater ratio of thrust to weight than any previous engine. It is built
at the edge of, or outside of, previous engineering experience.
Therefore, as expected, many different kinds of flaws and difficulties
have turned up. Because, unfortunately, it was built in the top-down
manner, they are difficult to find and fix. The design aim of a lifetime
of 55 missions equivalent firings (27,000 seconds of operation, either
in a mission of 500 seconds, or on a test stand) has not been obtained.
The engine now requires very frequent maintenance and replacement of
important parts, such as turbopumps, bearings, sheet metal housings,
etc. The high-pressure fuel turbopump had to be replaced every three or
four mission equivalents (although that may have been fixed, now) and
the high pressure oxygen turbopump every five or six. This is at most
ten percent of the original specification. But our main concern here is
the determination of reliability.

In a total of about 250,000 seconds of operation, the engines have
failed seriously perhaps 16 times. Engineering pays close attention to
these failings and tries to remedy them as quickly as possible. This it
does by test studies on special rigs experimentally designed for the
flaws in question, by careful inspection of the engine for suggestive
clues (like cracks), and by considerable study and analysis. In this
way, in spite of the difficulties of top-down design, through hard work,
many of the problems have apparently been solved.

\[**F3**\] A list of some of the problems follows. Those followed by an
asterisk (\*) are probably solved:

  - Turbine blade cracks in high pressure fuel turbopumps (HPFTP). (May
    have been solved.)
  - Turbine blade cracks in high pressure oxygen turbopumps (HPOTP).
  - Augmented Spark Igniter (ASI) line rupture.\*
  - Purge check valve failure.\*
  - ASI chamber erosion.\*
  - HPFTP turbine sheet metal cracking.
  - HPFTP coolant liner failure.\*
  - Main combustion chamber outlet elbow failure.\*
  - Main combustion chamber inlet elbow weld offset.\*
  - HPOTP subsynchronous whirl.\*
  - Flight acceleration safety cutoff system (partial failure in a
    redundant system).\*
  - Bearing spalling (partially solved).
  - A vibration at 4,000 Hertz making some engines inoperable, etc.

Many of these solved problems are the early difficulties of a new
design, for 13 of them occurred in the first 125,000 seconds and only
three in the second 125,000 seconds. Naturally, one can never be sure
that all the bugs are out, and, for some, the fix may not have addressed
the true cause. Thus, it is not unreasonable to guess there may be at
least one surprise in the next 250,000 seconds, a probability of 1/500
per engine per mission. On a mission there are three engines, but some
accidents would possibly be contained, and only affect one engine. The
system can abort with only two engines. Therefore let us say that the
unknown suprises do not, even of themselves, permit us to guess that the
probability of mission failure do to the Space Shuttle Main Engine is
less than 1/500. To this we must add the chance of failure from known,
but as yet unsolved, problems (those without the asterisk in the list
above). These we discuss below. (Engineers at Rocketdyne, the
manufacturer, estimate the total probability as 1/10,000. Engineers at
marshal estimate it as 1/300, while NASA management, to whom these
engineers report, claims it is 1/100,000. An independent engineer
consulting for NASA thought 1 or 2 per 100 a reasonable estimate.)

The history of the certification principles for these engines is
confusing and difficult to explain. Initially the rule seems to have
been that two sample engines must each have had twice the time operating
without failure as the operating time of the engine to be certified
(rule of 2x). At least that is the FAA practice, and NASA seems to have
adopted it, originally expecting the certified time to be 10 missions
(hence 20 missions for each sample). Obviously the best engines to use
for comparison would be those of greatest total (flight plus test)
operating time -- the so-called "fleet leaders." But what if a third
sample and several others fail in a short time? Surely we will not be
safe because two were unusual in lasting longer. The short time might be
more representative of the real possibilities, and in the spirit of the
safety factor of 2, we should only operate at half the time of the
short-lived samples.

The slow shift toward decreasing safety factor can be seen in many
examples. We take that of the HPFTP turbine blades. First of all the
idea of testing an entire engine was abandoned. Each engine number has
had many important parts (like the turbopumps themselves) replaced at
frequent intervals, so that the rule must be shifted from engines to
components. We accept an HPFTP for a certification time if two samples
have each run successfully for twice that time (and of course, as a
practical matter, no longer insisting that this time be as large as 10
missions). But what is "successfully?" The FAA calls a turbine blade
crack a failure, in order, in practice, to really provide a safety
factor greater than 2. There is some time that an engine can run between
the time a crack originally starts until the time it has grown large
enough to fracture. (The FAA is contemplating new rules that take this
extra safety time into account, but only if it is very carefully
analyzed through known models within a known range of experience and
with materials thoroughly tested. None of these conditions apply to the
Space Shuttle Main Engine.

Cracks were found in many second stage HPFTP turbine blades. In one case
three were found after 1,900 seconds, while in another they were not
found after 4,200 seconds, although usually these longer runs showed
cracks. To follow this story further we shall have to realize that the
stress depends a great deal on the power level. The Challenger flight
was to be at, and previous flights had been at, a power level called
104% of rated power level during most of the time the engines were
operating. Judging from some material data it is supposed that at the
level 104% of rated power level, the time to crack is about twice that
at 109% or full power level (FPL). Future flights were to be at this
level because of heavier payloads, and many tests were made at this
level. Therefore dividing time at 104% by 2, we obtain units called
equivalent full power level (EFPL). (Obviously, some uncertainty is
introduced by that, but it has not been studied.) The earliest cracks
mentioned above occurred at 1,375 EFPL.

Now the certification rule becomes "limit all second stage blades to a
maximum of 1,375 seconds EFPL." If one objects that the safety factor of
2 is lost it is pointed out that the one turbine ran for 3,800 seconds
EFPL without cracks, and half of this is 1,900 so we are being more
conservative. We have fooled ourselves in three ways. First we have only
one sample, and it is not the fleet leader, for the other two samples of
3,800 or more seconds had 17 cracked blades between them. (There are 59
blades in the engine.) Next we have abandoned the 2x rule and
substituted equal time. And finally, 1,375 is where we did see a crack.
We can say that no crack had been found below 1,375, but the last time
we looked and saw no cracks was 1,100 seconds EFPL. We do not know when
the crack formed between these times, for example cracks may have formed
at 1,150 seconds EFPL. (Approximately 2/3 of the blade sets tested in
excess of 1,375 seconds EFPL had cracks. Some recent experiments have,
indeed, shown cracks as early as 1,150 seconds.) It was important to
keep the number high, for the Challenger was to fly an engine very close
to the limit by the time the flight was over.

Finally it is claimed that the criteria are not abandoned, and the
system is safe, by giving up the FAA convention that there should be no
cracks, and considering only a completely fractured blade a failure.
With this definition no engine has yet failed. The idea is that since
there is sufficient time for a crack to grow to a fracture we can insure
that all is safe by inspecting all blades for cracks. If they are found,
replace them, and if none are found we have enough time for a safe
mission. This makes the crack problem not a flight safety problem, but
merely a maintenance problem.

This may in fact be true. But how well do we know that cracks always
grow slowly enough that no fracture can occur in a mission? Three
engines have run for long times with a few cracked blades (about 3,000
seconds EFPL) with no blades broken off.

But a fix for this cracking may have been found. By changing the blade
shape, shot-peening the surface, and covering with insulation to exclude
thermal shock, the blades have not cracked so far.

A very similar story appears in the history of certification of the
HPOTP, but we shall not give the details here.

It is evident, in summary, that the Flight Readiness Reviews and
certification rules show a deterioration for some of the problems of the
Space Shuttle Main Engine that is closely analogous to the deterioration
seen in the rules for the Solid Rocket Booster.

** **

**Avionics**

By "avionics" is meant the computer system on the Orbiter as well as its
input sensors and output actuators. At first we will restrict ourselves
to the computers proper and not be concerned with the reliability of the
input information from the sensors of \[**F4**\] temperature, pressure,
etc., nor with whether the computer output is faithfully followed by the
actuators of rocket firings, mechanical controls, displays to
astronauts, etc.

The computer system is very elaborate, having over 250,000 lines of
code. It is responsible, among many other things, for the automatic
control of the entire ascent to orbit, and for the descent until well
into the atmosphere (below Mach 1) once one button is pushed deciding
the landing site desired. It would be possible to make the entire
landing automatically (except that the landing gear lowering signal is
expressly left out of computer control, and must be provided by the
pilot, ostensibly for safety reasons) but such an entirely automatic
landing is probably not as safe as a pilot controlled landing. During
orbital flight it is used in the control of payloads, in displaying
information to the astronauts, and the exchange of information to the
ground. It is evident that the safety of flight requires guaranteed
accuracy of this elaborate system of computer hardware and software.

In brief, the hardware reliability is ensured by having four essentially
independent identical computer systems. Where possible each sensor also
has multiple copies, usually four, and each copy feeds all four of the
computer lines. If the inputs from the sensors disagree, depending on
circumstances, certain averages, or a majority selection is used as the
effective input. The algorithm used by each of the four computers is
exactly the same, so their inputs (since each sees all copies of the
sensors) are the same. Therefore at each step the results in each
computer should be identical. From time to time they are compared, but
because they might operate at slightly different speeds a system of
stopping and waiting at specific times is instituted before each
comparison is made. If one of the computers disagrees, or is too late in
having its answer ready, the three which do agree are assumed to be
correct and the errant computer is taken completely out of the system.
If, now, another computer fails, as judged by the agreement of the other
two, it is taken out of the system, and the rest of the flight canceled,
and descent to the landing site is instituted, controlled by the two
remaining computers. It is seen that this is a redundant system since
the failure of only one computer does not affect the mission. Finally,
as an extra feature of safety, there is a fifth independent computer,
whose memory is loaded with only the programs of ascent and descent, and
which is capable of controlling the descent if there is a failure of
more than two of the computers of the main line four.

There is not enough room in the memory of the main line computers for
all the programs of ascent, descent, and payload programs in flight, so
the memory is loaded about four time from tapes, by the astronauts.

Because of the enormous effort required to replace the software for such
an elaborate system, and for checking a new system out, no change has
been made to the hardware since the system began about fifteen years
ago. The actual hardware is obsolete; for example, the memories are of
the old ferrite core type. It is becoming more difficult to find
manufacturers to supply such old-fashioned computers reliably and of
high quality. Modern computers are very much more reliable, can run much
faster, simplifying circuits, and allowing more to be done, and would
not require so much loading of memory, for the memories are much larger.

The software is checked very carefully in a bottom-up fashion. First,
each new line of code is checked, then sections of code or modules with
special functions are verified. The scope is increased step by step
until the new changes are incorporated into a complete system and
checked. This complete output is considered the final product, newly
released. But completely independently there is an independent
verification group, that takes an adversary attitude to the software
development group, and tests and verifies the software as if it were a
customer of the delivered product. There is additional verification in
using the new programs in simulators, etc. A discovery of an error
during verification testing is considered very serious, and its origin
studied very carefully to avoid such mistakes in the future. Such
unexpected errors have been found only about six times in all the
programming and program changing (for new or altered payloads) that has
been done. The principle that is followed is that all the verification
is not an aspect of program safety, it is merely a test of that safety,
in a non-catastrophic verification. Flight safety is to be judged solely
on how well the programs do in the verification tests. A failure here
generates considerable concern.

To summarize then, the computer software checking system and attitude is
of the highest quality. There appears to be no process of gradually
fooling oneself while degrading standards so characteristic of the Solid
Rocket Booster or Space Shuttle Main Engine safety systems. To be sure,
there have been recent suggestions by management to curtail such
elaborate and expensive tests as being unnecessary at this late date in
Shuttle history. This must be resisted for it does not appreciate the
mutual subtle influences, and sources of error generated by even small
changes of one part of a program on another. There are perpetual
requests for changes as new payloads and new demands and modifications
are suggested by the users. Changes are expensive because they require
extensive testing. The proper way to save money is to curtail the number
of requested changes, not the quality of testing for each.

One might add that the elaborate system could be very much improved by
more modern hardware and programming techniques. Any outside competition
would have all the advantages of starting over, and whether that is a
good idea for NASA now should be carefully considered.

Finally, returning to the sensors and actuators of the avionics system,
we find that the attitude to system failure and reliability is not
nearly as good as for the computer system. For example, a difficulty was
found with certain temperature sensors sometimes failing. Yet 18 months
later the same sensors were still being used, still sometimes failing,
until a launch had to be scrubbed because two of them failed at the same
time. Even on a succeeding flight this unreliable sensor was used again.
Again reaction control systems, the rocket jets used for reorienting and
control in flight still are somewhat unreliable. There is considerable
redundancy, but a long history of failures, none of which has yet been
extensive enough to seriously affect flight. The action of the jets is
checked by sensors, and, if they fail to fire the computers choose
another jet to fire. But they are not designed to fail, and the problem
should be solved.

** **

**Conclusions**

If a reasonable launch schedule is to be maintained, engineering often
cannot be done fast enough to keep up with the expectations of
originally conservative certification criteria designed to guarantee a
very safe vehicle. In these situations, subtly, and often with
apparently logical arguments, the criteria are altered so that flights
may still be certified in time. They therefore fly in a relatively
unsafe condition, with a chance of failure of the order of a percent (it
is difficult to be more accurate).

Official management, on the other hand, claims to believe the
probability of failure is a thousand times less. One reason for this may
be an attempt to assure the government of NASA perfection and success in
order to ensure the supply of funds. The other may be that they
sincerely believed it to be true, demonstrating an almost incredible
lack of communication between themselves and their working engineers.

In any event this has had very unfortunate consequences, the most
serious of which is to encourage ordinary citizens to fly in such a
dangerous machine, as if it had attained the safety of an ordinary
airliner. The astronauts, like test pilots, should know their risks, and
we honor them for their courage. Who can doubt that McAuliffe was
equally a person of great courage, who was closer to an awareness of the
true risk than NASA management would have us believe?

\[**F5**\] Let us make recommendations to ensure that NASA officials
deal in a world of reality in understanding technological weaknesses and
imperfections well enough to be actively trying to eliminate them. They
must live in reality in comparing the costs and utility of the Shuttle
to other methods of entering space. And they must be realistic in making
contracts, in estimating costs, and the difficulty of the projects. Only
realistic flight schedules should be proposed, schedules that have a
reasonable chance of being met. If in this way the government would not
support them, then so be it. NASA owes it to the citizens from whom it
asks support to be frank, honest, and informative, so that these
citizens can make the wisest decisions for the use of their limited
resources.

For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.
-												Tufte CSS, upgrade jekyll, add metadata, index

											
										
										
											2018-02-23 18:58:03 +00:00
+								---
 								created_at: '2017-10-08T01:26:46.000Z'
 								title: Personal Observations on Reliability of Shuttle (1986)
 								url: https://history.nasa.gov/rogersrep/v2appf.htm
 								author: michaelsbradley
 								points: 109
 								story_text:
 								comment_text:
 								num_comments: 15
 								story_id:
 								story_title:
 								story_url:
 								parent_id:
 								created_at_i: 1507426006
 								_tags:
 								- story
 								- author_michaelsbradley
 								- story_15426562
 								objectID: '15426562'
-												Adds year as metadata to all stories

											
										
										
											2018-06-08 12:05:27 +00:00
+								year: 1986
-												Tufte CSS, upgrade jekyll, add metadata, index

											
										
										
											2018-02-23 18:58:03 +00:00
 								---
-												New parser with newspaper + pandoc

											
										
										
											2018-03-03 09:35:28 +00:00
+								**Report of the PRESIDENTIAL COMMISSION on the Space Shuttle Challenger
 								Accident**
-												Initial commit

											
										
										
											2018-02-23 18:19:40 +00:00
-												New parser with newspaper + pandoc

											
										
										
											2018-03-03 09:35:28 +00:00
+								** **
-												Initial commit

											
										
										
											2018-02-23 18:19:40 +00:00
-												New parser with newspaper + pandoc

											
										
										
											2018-03-03 09:35:28 +00:00
+								**Volume 2: Appendix F - Personal Observations on Reliability of
 								Shuttle**
-												Initial commit

											
										
										
											2018-02-23 18:19:40 +00:00
-												New parser with newspaper + pandoc

											
										
										
											2018-03-03 09:35:28 +00:00
+								by R. P. Feynman
 								** **
 								** **
 								**Introduction**
 								\[**F1**\] It appears that there are enormous differences of opinion as
 								to the probability of a failure with loss of vehicle and of human life.
 								The estimates range from roughly 1 in 100 to 1 in 100,000. The higher
 								figures come from the working engineers, and the very low figures from
 								management. What are the causes and consequences of this lack of
 								agreement? Since 1 part in 100,000 would imply that one could put a
 								Shuttle up each day for 300 years expecting to lose only one, we could
 								properly ask "What is the cause of management's fantastic faith in the
 								machinery?"
 								We have also found that certification criteria used in Flight Readiness
 								Reviews often develop a gradually decreasing strictness. The argument
 								that the same risk was flown before without failure is often accepted as
 								an argument for the safety of accepting it again. Because of this,
 								obvious weaknesses are accepted again and again, sometimes without a
 								sufficiently serious attempt to remedy them, or to delay a flight
 								because of their continued presence.
 								There are several sources of information. There are published criteria
 								for certification, including a history of modifications in the form of
 								waivers and deviations. In addition, the records of the Flight Readiness
 								Reviews for each flight document the arguments used to accept the risks
 								of the flight. Information was obtained from the direct testimony and
 								the reports of the range safety officer, Louis J. Ullian, with respect
 								to the history of success of solid fuel rockets. There was a further
 								study by him (as chairman of the launch abort safety panel (LASP)) in an
 								attempt to determine the risks involved in possible accidents leading to
 								radioactive contamination from attempting to fly a plutonium power
 								supply (RTG) for future planetary missions. The NASA study of the same
 								question is also available. For the History of the Space Shuttle Main
 								Engines, interviews with management and engineers at Marshall, and
 								informal interviews with engineers at Rocketdyne, were made. An
 								independent (Cal Tech) mechanical engineer who consulted for NASA about
 								engines was also interviewed informally. A visit to Johnson was made to
 								gather information on the reliability of the avionics (computers,
 								sensors, and effectors). Finally there is a report "A Review of
 								Certification Practices, Potentially Applicable to Man-rated Reusable
 								Rocket Engines," prepared at the Jet Propulsion Laboratory by N. Moore,
 								et al., in February, 1986, for NASA Headquarters, Office of Space
 								Flight. It deals with the methods used by the FAA and the military to
 								certify their gas turbine and rocket engines. These authors were also
 								interviewed informally.
 								** **
 								**Solid Fuel Rockets (SRB)**
 								An estimate of the reliability of solid rockets was made by the range
 								safety officer, by studying the experience of all previous rocket
 								flights. Out of a total of nearly 2,900 flights, 121 failed (1 in 25).
 								This includes, however, what may be called, early errors, rockets flown
 								for the first few times in which design errors are discovered and fixed.
 								A more reasonable figure for the mature rockets might be 1 in 50. With
 								special care in the selection of parts and in inspection, a figure of
 								below 1 in 100 might be achieved but 1 in 1,000 is probably not
 								attainable with today's technology. (Since there are two rockets on the
 								Shuttle, these rocket failure rates must be doubled to get Shuttle
 								failure rates from Solid Rocket Booster failure.)
 								NASA officials argue that the figure is much lower. They point out that
 								these figures are for unmanned rockets but since the Shuttle is a manned
 								vehicle "the probability of mission success is necessarily very close to
 .0." It is not very clear what this phrase means. Does it mean it is
 								close to 1 or that it ought to be close to 1? They go on to explain
 								"Historically this extremely high degree of mission success has given
 								rise to a difference in philosophy between manned space flight programs
 								and unmanned programs; i.e., numerical probability usage versus
 								engineering judgment." (These quotations are from "Space Shuttle Data
 								for Planetary Mission RTG Safety Analysis," Pages 3-1, 3-2, February 15,
 , NASA, JSC.) It is true that if the probability of failure was as
 								low as 1 in 100,000 it would take an inordinate number of tests to
 								determine it ( you would get nothing but a string of perfect flights
 								from which no precise figure, other than that the probability is likely
 								less than the number of such flights in the string so far). But, if the
 								real probability is not so small, flights would show troubles, near
 								failures, and possible actual failures with a reasonable number of
 								trials. and standard statistical methods could give a reasonable
 								estimate. In fact, previous NASA experience had shown, on occasion, just
 								such difficulties, near accidents, and accidents, all giving warning
 								that the probability of flight failure was not so very small. The
 								inconsistency of the argument not to determine reliability through
 								historical experience, as the range safety officer did, is that NASA
 								also appeals to history, beginning "Historically this high degree of
 								mission success..." Finally, if we are to replace standard numerical
 								probability usage with engineering judgment, why do we find such an
 								enormous disparity between the management estimate and the judgment of
 								the engineers? It would appear that, for whatever purpose, be it for
 								internal or external consumption, the management of NASA exaggerates the
 								reliability of its product, to the point of fantasy.
 								The history of the certification and Flight Readiness Reviews will not
 								be repeated here. (See other part of Commission reports.) The phenomenon
 								of accepting for flight, seals that had shown erosion and blow-by in
 								previous flights, is very clear. The Challenger flight is an excellent
 								example. There are several references to flights that had gone before.
 								The acceptance and success of these flights is taken as evidence of
 								safety. But erosion and blow-by are not what the design expected. They
 								are warnings that something is wrong. The equipment is not operating as
 								expected, and therefore there is a danger that it can operate with even
 								wider deviations in this unexpected and not thoroughly understood way.
 								The fact that this danger did not lead to a catastrophe before is no
 								guarantee that it will not the next time, unless it is completely
 								understood. When playing Russian roulette the fact that the first shot
 								got off safely is little comfort for the next. The origin and
 								consequences of the erosion and blow-by were not understood. They did
 								not occur equally on all flights and all joints; sometimes more, and
 								sometimes less. Why not sometime, when whatever conditions determined it
 								were right, still more leading to catastrophe?
 								In spite of these variations from case to case, officials behaved as if
 								they understood it, giving apparently logical arguments to each other
 								often depending on the "success" of previous flights. For example. in
 								determining if flight 51-L was safe to fly in the face of ring erosion
 								in flight 51-C, it was noted that the erosion depth was only one-third
 								of the radius. It had been noted in an \[**F2**\] experiment cutting the
 								ring that cutting it as deep as one radius was necessary before the ring
 								failed. Instead of being very concerned that variations of poorly
 								understood conditions might reasonably create a deeper erosion this
 								time, it was asserted, there was "a safety factor of three." This is a
 								strange use of the engineer's term ,"safety factor." If a bridge is
 								built to withstand a certain load without the beams permanently
 								deforming, cracking, or breaking, it may be designed for the materials
 								used to actually stand up under three times the load. This "safety
 								factor" is to allow for uncertain excesses of load, or unknown extra
 								loads, or weaknesses in the material that might have unexpected flaws,
 								etc. If now the expected load comes on to the new bridge and a crack
 								appears in a beam, this is a failure of the design. There was no safety
 								factor at all; even though the bridge did not actually collapse because
 								the crack went only one-third of the way through the beam. The O-rings
 								of the Solid Rocket Boosters were not designed to erode. Erosion was a
 								clue that something was wrong. Erosion was not something from which
 								safety can be inferred.
 								There was no way, without full understanding, that one could have
 								confidence that conditions the next time might not produce erosion three
 								times more severe than the time before. Nevertheless, officials fooled
 								themselves into thinking they had such understanding and confidence, in
 								spite of the peculiar variations from case to case. A mathematical model
 								was made to calculate erosion. This was a model based not on physical
 								understanding but on empirical curve fitting. To be more detailed, it
 								was supposed a stream of hot gas impinged on the O-ring material, and
 								the heat was determined at the point of stagnation (so far, with
 								reasonable physical, thermodynamic laws). But to determine how much
 								rubber eroded it was assumed this depended only on this heat by a
 								formula suggested by data on a similar material. A logarithmic plot
 								suggested a straight line, so it was supposed that the erosion varied as
 								the .58 power of the heat, the .58 being determined by a nearest fit. At
 								any rate, adjusting some other numbers, it was determined that the model
 								agreed with the erosion (to depth of one-third the radius of the ring).
 								There is nothing much so wrong with this as believing the answer\!
 								Uncertainties appear everywhere. How strong the gas stream might be was
 								unpredictable, it depended on holes formed in the putty. Blow-by showed
 								that the ring might fail even though not, or only partially eroded
 								through. The empirical formula was known to be uncertain, for it did not
 								go directly through the very data points by which it was determined.
 								There were a cloud of points some twice above, and some twice below the
 								fitted curve, so erosions twice predicted were reasonable from that
 								cause alone. Similar uncertainties surrounded the other constants in the
 								formula, etc., etc. When using a mathematical model careful attention
 								must be given to uncertainties in the model.
 								** **
 								**Liquid Fuel Engine (SSME)**
 								During the flight of 51-L the three Space Shuttle Main Engines all
 								worked perfectly, even, at the last moment, beginning to shut down the
 								engines as the fuel supply began to fail. The question arises, however,
 								as to whether, had it failed, and we were to investigate it in as much
 								detail as we did the Solid Rocket Booster, we would find a similar lack
 								of attention to faults and a deteriorating reliability. In other words,
 								were the organization weaknesses that contributed to the accident
 								confined to the Solid Rocket Booster sector or were they a more general
 								characteristic of NASA? To that end the Space Shuttle Main Engines and
 								the avionics were both investigated. No similar study of the Orbiter, or
 								the External Tank were made.
 								The engine is a much more complicated structure than the Solid Rocket
 								Booster, and a great deal more detailed engineering goes into it.
 								Generally, the engineering seems to be of high quality and apparently
 								considerable attention is paid to deficiencies and faults found in
 								operation.
 								The usual way that such engines are designed (for military or civilian
 								aircraft) may be called the component system, or bottom-up design. First
 								it is necessary to thoroughly understand the properties and limitations
 								of the materials to be used (for turbine blades, for example), and tests
 								are begun in experimental rigs to determine those. With this knowledge
 								larger component parts (such as bearings) are designed and tested
 								individually. As deficiencies and design errors are noted they are
 								corrected and verified with further testing. Since one tests only parts
 								at a time these tests and modifications are not overly expensive.
 								Finally one works up to the final design of the entire engine, to the
 								necessary specifications. There is a good chance, by this time that the
 								engine will generally succeed, or that any failures are easily isolated
 								and analyzed because the failure modes, limitations of materials, etc.,
 								are so well understood. There is a very good chance that the
 								modifications to the engine to get around the final difficulties are not
 								very hard to make, for most of the serious problems have already been
 								discovered and dealt with in the earlier, less expensive, stages of the
 								process.
 								The Space Shuttle Main Engine was handled in a different manner, top
 								down, we might say. The engine was designed and put together all at once
 								with relatively little detailed preliminary study of the material and
 								components. Then when troubles are found in the bearings, turbine
 								blades, coolant pipes, etc., it is more expensive and difficult to
 								discover the causes and make changes. For example, cracks have been
 								found in the turbine blades of the high pressure oxygen turbopump. Are
 								they caused by flaws in the material, the effect of the oxygen
 								atmosphere on the properties of the material, the thermal stresses of
 								startup or shutdown, the vibration and stresses of steady running, or
 								mainly at some resonance at certain speeds, etc.? How long can we run
 								from crack initiation to crack failure, and how does this depend on
 								power level? Using the completed engine as a test bed to resolve such
 								questions is extremely expensive. One does not wish to lose an entire
 								engine in order to find out where and how failure occurs. Yet, an
 								accurate knowledge of this information is essential to acquire a
 								confidence in the engine reliability in use. Without detailed
 								understanding, confidence can not be attained.
 								A further disadvantage of the top-down method is that, if an
 								understanding of a fault is obtained, a simple fix, such as a new shape
 								for the turbine housing, may be impossible to implement without a
 								redesign of the entire engine.
 								The Space Shuttle Main Engine is a very remarkable machine. It has a
 								greater ratio of thrust to weight than any previous engine. It is built
 								at the edge of, or outside of, previous engineering experience.
 								Therefore, as expected, many different kinds of flaws and difficulties
 								have turned up. Because, unfortunately, it was built in the top-down
 								manner, they are difficult to find and fix. The design aim of a lifetime
 								of 55 missions equivalent firings (27,000 seconds of operation, either
 								in a mission of 500 seconds, or on a test stand) has not been obtained.
 								The engine now requires very frequent maintenance and replacement of
 								important parts, such as turbopumps, bearings, sheet metal housings,
 								etc. The high-pressure fuel turbopump had to be replaced every three or
 								four mission equivalents (although that may have been fixed, now) and
 								the high pressure oxygen turbopump every five or six. This is at most
 								ten percent of the original specification. But our main concern here is
 								the determination of reliability.
 								In a total of about 250,000 seconds of operation, the engines have
 								failed seriously perhaps 16 times. Engineering pays close attention to
 								these failings and tries to remedy them as quickly as possible. This it
 								does by test studies on special rigs experimentally designed for the
 								flaws in question, by careful inspection of the engine for suggestive
 								clues (like cracks), and by considerable study and analysis. In this
 								way, in spite of the difficulties of top-down design, through hard work,
 								many of the problems have apparently been solved.
 								\[**F3**\] A list of some of the problems follows. Those followed by an
 								asterisk (\*) are probably solved:
 								  - Turbine blade cracks in high pressure fuel turbopumps (HPFTP). (May
 								    have been solved.)
 								  - Turbine blade cracks in high pressure oxygen turbopumps (HPOTP).
 								  - Augmented Spark Igniter (ASI) line rupture.\*
 								  - Purge check valve failure.\*
 								  - ASI chamber erosion.\*
 								  - HPFTP turbine sheet metal cracking.
 								  - HPFTP coolant liner failure.\*
 								  - Main combustion chamber outlet elbow failure.\*
 								  - Main combustion chamber inlet elbow weld offset.\*
 								  - HPOTP subsynchronous whirl.\*
 								  - Flight acceleration safety cutoff system (partial failure in a
 								    redundant system).\*
 								  - Bearing spalling (partially solved).
 								  - A vibration at 4,000 Hertz making some engines inoperable, etc.
 								Many of these solved problems are the early difficulties of a new
 								design, for 13 of them occurred in the first 125,000 seconds and only
 								three in the second 125,000 seconds. Naturally, one can never be sure
 								that all the bugs are out, and, for some, the fix may not have addressed
 								the true cause. Thus, it is not unreasonable to guess there may be at
 								least one surprise in the next 250,000 seconds, a probability of 1/500
 								per engine per mission. On a mission there are three engines, but some
 								accidents would possibly be contained, and only affect one engine. The
 								system can abort with only two engines. Therefore let us say that the
 								unknown suprises do not, even of themselves, permit us to guess that the
 								probability of mission failure do to the Space Shuttle Main Engine is
 								less than 1/500. To this we must add the chance of failure from known,
 								but as yet unsolved, problems (those without the asterisk in the list
 								above). These we discuss below. (Engineers at Rocketdyne, the
 								manufacturer, estimate the total probability as 1/10,000. Engineers at
 								marshal estimate it as 1/300, while NASA management, to whom these
 								engineers report, claims it is 1/100,000. An independent engineer
 								consulting for NASA thought 1 or 2 per 100 a reasonable estimate.)
 								The history of the certification principles for these engines is
 								confusing and difficult to explain. Initially the rule seems to have
 								been that two sample engines must each have had twice the time operating
 								without failure as the operating time of the engine to be certified
 								(rule of 2x). At least that is the FAA practice, and NASA seems to have
 								adopted it, originally expecting the certified time to be 10 missions
 								(hence 20 missions for each sample). Obviously the best engines to use
 								for comparison would be those of greatest total (flight plus test)
 								operating time -- the so-called "fleet leaders." But what if a third
 								sample and several others fail in a short time? Surely we will not be
 								safe because two were unusual in lasting longer. The short time might be
 								more representative of the real possibilities, and in the spirit of the
 								safety factor of 2, we should only operate at half the time of the
 								short-lived samples.
 								The slow shift toward decreasing safety factor can be seen in many
 								examples. We take that of the HPFTP turbine blades. First of all the
 								idea of testing an entire engine was abandoned. Each engine number has
 								had many important parts (like the turbopumps themselves) replaced at
 								frequent intervals, so that the rule must be shifted from engines to
 								components. We accept an HPFTP for a certification time if two samples
 								have each run successfully for twice that time (and of course, as a
 								practical matter, no longer insisting that this time be as large as 10
 								missions). But what is "successfully?" The FAA calls a turbine blade
 								crack a failure, in order, in practice, to really provide a safety
 								factor greater than 2. There is some time that an engine can run between
 								the time a crack originally starts until the time it has grown large
 								enough to fracture. (The FAA is contemplating new rules that take this
 								extra safety time into account, but only if it is very carefully
 								analyzed through known models within a known range of experience and
 								with materials thoroughly tested. None of these conditions apply to the
 								Space Shuttle Main Engine.
 								Cracks were found in many second stage HPFTP turbine blades. In one case
 								three were found after 1,900 seconds, while in another they were not
 								found after 4,200 seconds, although usually these longer runs showed
 								cracks. To follow this story further we shall have to realize that the
 								stress depends a great deal on the power level. The Challenger flight
 								was to be at, and previous flights had been at, a power level called
 % of rated power level during most of the time the engines were
 								operating. Judging from some material data it is supposed that at the
 								level 104% of rated power level, the time to crack is about twice that
 								at 109% or full power level (FPL). Future flights were to be at this
 								level because of heavier payloads, and many tests were made at this
 								level. Therefore dividing time at 104% by 2, we obtain units called
 								equivalent full power level (EFPL). (Obviously, some uncertainty is
 								introduced by that, but it has not been studied.) The earliest cracks
 								mentioned above occurred at 1,375 EFPL.
 								Now the certification rule becomes "limit all second stage blades to a
 								maximum of 1,375 seconds EFPL." If one objects that the safety factor of
 is lost it is pointed out that the one turbine ran for 3,800 seconds
 								EFPL without cracks, and half of this is 1,900 so we are being more
 								conservative. We have fooled ourselves in three ways. First we have only
 								one sample, and it is not the fleet leader, for the other two samples of
 ,800 or more seconds had 17 cracked blades between them. (There are 59
 								blades in the engine.) Next we have abandoned the 2x rule and
 								substituted equal time. And finally, 1,375 is where we did see a crack.
 								We can say that no crack had been found below 1,375, but the last time
 								we looked and saw no cracks was 1,100 seconds EFPL. We do not know when
 								the crack formed between these times, for example cracks may have formed
 								at 1,150 seconds EFPL. (Approximately 2/3 of the blade sets tested in
 								excess of 1,375 seconds EFPL had cracks. Some recent experiments have,
 								indeed, shown cracks as early as 1,150 seconds.) It was important to
 								keep the number high, for the Challenger was to fly an engine very close
 								to the limit by the time the flight was over.
 								Finally it is claimed that the criteria are not abandoned, and the
 								system is safe, by giving up the FAA convention that there should be no
 								cracks, and considering only a completely fractured blade a failure.
 								With this definition no engine has yet failed. The idea is that since
 								there is sufficient time for a crack to grow to a fracture we can insure
 								that all is safe by inspecting all blades for cracks. If they are found,
 								replace them, and if none are found we have enough time for a safe
 								mission. This makes the crack problem not a flight safety problem, but
 								merely a maintenance problem.
 								This may in fact be true. But how well do we know that cracks always
 								grow slowly enough that no fracture can occur in a mission? Three
 								engines have run for long times with a few cracked blades (about 3,000
 								seconds EFPL) with no blades broken off.
 								But a fix for this cracking may have been found. By changing the blade
 								shape, shot-peening the surface, and covering with insulation to exclude
 								thermal shock, the blades have not cracked so far.
 								A very similar story appears in the history of certification of the
 								HPOTP, but we shall not give the details here.
 								It is evident, in summary, that the Flight Readiness Reviews and
 								certification rules show a deterioration for some of the problems of the
 								Space Shuttle Main Engine that is closely analogous to the deterioration
 								seen in the rules for the Solid Rocket Booster.
 								** **
 								**Avionics**
 								By "avionics" is meant the computer system on the Orbiter as well as its
 								input sensors and output actuators. At first we will restrict ourselves
 								to the computers proper and not be concerned with the reliability of the
 								input information from the sensors of \[**F4**\] temperature, pressure,
 								etc., nor with whether the computer output is faithfully followed by the
 								actuators of rocket firings, mechanical controls, displays to
 								astronauts, etc.
 								The computer system is very elaborate, having over 250,000 lines of
 								code. It is responsible, among many other things, for the automatic
 								control of the entire ascent to orbit, and for the descent until well
 								into the atmosphere (below Mach 1) once one button is pushed deciding
 								the landing site desired. It would be possible to make the entire
 								landing automatically (except that the landing gear lowering signal is
 								expressly left out of computer control, and must be provided by the
 								pilot, ostensibly for safety reasons) but such an entirely automatic
 								landing is probably not as safe as a pilot controlled landing. During
 								orbital flight it is used in the control of payloads, in displaying
 								information to the astronauts, and the exchange of information to the
 								ground. It is evident that the safety of flight requires guaranteed
 								accuracy of this elaborate system of computer hardware and software.
 								In brief, the hardware reliability is ensured by having four essentially
 								independent identical computer systems. Where possible each sensor also
 								has multiple copies, usually four, and each copy feeds all four of the
 								computer lines. If the inputs from the sensors disagree, depending on
 								circumstances, certain averages, or a majority selection is used as the
 								effective input. The algorithm used by each of the four computers is
 								exactly the same, so their inputs (since each sees all copies of the
 								sensors) are the same. Therefore at each step the results in each
 								computer should be identical. From time to time they are compared, but
 								because they might operate at slightly different speeds a system of
 								stopping and waiting at specific times is instituted before each
 								comparison is made. If one of the computers disagrees, or is too late in
 								having its answer ready, the three which do agree are assumed to be
 								correct and the errant computer is taken completely out of the system.
 								If, now, another computer fails, as judged by the agreement of the other
 								two, it is taken out of the system, and the rest of the flight canceled,
 								and descent to the landing site is instituted, controlled by the two
 								remaining computers. It is seen that this is a redundant system since
 								the failure of only one computer does not affect the mission. Finally,
 								as an extra feature of safety, there is a fifth independent computer,
 								whose memory is loaded with only the programs of ascent and descent, and
 								which is capable of controlling the descent if there is a failure of
 								more than two of the computers of the main line four.
 								There is not enough room in the memory of the main line computers for
 								all the programs of ascent, descent, and payload programs in flight, so
 								the memory is loaded about four time from tapes, by the astronauts.
 								Because of the enormous effort required to replace the software for such
 								an elaborate system, and for checking a new system out, no change has
 								been made to the hardware since the system began about fifteen years
 								ago. The actual hardware is obsolete; for example, the memories are of
 								the old ferrite core type. It is becoming more difficult to find
 								manufacturers to supply such old-fashioned computers reliably and of
 								high quality. Modern computers are very much more reliable, can run much
 								faster, simplifying circuits, and allowing more to be done, and would
 								not require so much loading of memory, for the memories are much larger.
 								The software is checked very carefully in a bottom-up fashion. First,
 								each new line of code is checked, then sections of code or modules with
 								special functions are verified. The scope is increased step by step
 								until the new changes are incorporated into a complete system and
 								checked. This complete output is considered the final product, newly
 								released. But completely independently there is an independent
 								verification group, that takes an adversary attitude to the software
 								development group, and tests and verifies the software as if it were a
 								customer of the delivered product. There is additional verification in
 								using the new programs in simulators, etc. A discovery of an error
 								during verification testing is considered very serious, and its origin
 								studied very carefully to avoid such mistakes in the future. Such
 								unexpected errors have been found only about six times in all the
 								programming and program changing (for new or altered payloads) that has
 								been done. The principle that is followed is that all the verification
 								is not an aspect of program safety, it is merely a test of that safety,
 								in a non-catastrophic verification. Flight safety is to be judged solely
 								on how well the programs do in the verification tests. A failure here
 								generates considerable concern.
 								To summarize then, the computer software checking system and attitude is
 								of the highest quality. There appears to be no process of gradually
 								fooling oneself while degrading standards so characteristic of the Solid
 								Rocket Booster or Space Shuttle Main Engine safety systems. To be sure,
 								there have been recent suggestions by management to curtail such
 								elaborate and expensive tests as being unnecessary at this late date in
 								Shuttle history. This must be resisted for it does not appreciate the
 								mutual subtle influences, and sources of error generated by even small
 								changes of one part of a program on another. There are perpetual
 								requests for changes as new payloads and new demands and modifications
 								are suggested by the users. Changes are expensive because they require
 								extensive testing. The proper way to save money is to curtail the number
 								of requested changes, not the quality of testing for each.
 								One might add that the elaborate system could be very much improved by
 								more modern hardware and programming techniques. Any outside competition
 								would have all the advantages of starting over, and whether that is a
 								good idea for NASA now should be carefully considered.
 								Finally, returning to the sensors and actuators of the avionics system,
 								we find that the attitude to system failure and reliability is not
 								nearly as good as for the computer system. For example, a difficulty was
 								found with certain temperature sensors sometimes failing. Yet 18 months
 								later the same sensors were still being used, still sometimes failing,
 								until a launch had to be scrubbed because two of them failed at the same
 								time. Even on a succeeding flight this unreliable sensor was used again.
 								Again reaction control systems, the rocket jets used for reorienting and
 								control in flight still are somewhat unreliable. There is considerable
 								redundancy, but a long history of failures, none of which has yet been
 								extensive enough to seriously affect flight. The action of the jets is
 								checked by sensors, and, if they fail to fire the computers choose
 								another jet to fire. But they are not designed to fail, and the problem
 								should be solved.
 								** **
 								**Conclusions**
 								If a reasonable launch schedule is to be maintained, engineering often
 								cannot be done fast enough to keep up with the expectations of
 								originally conservative certification criteria designed to guarantee a
 								very safe vehicle. In these situations, subtly, and often with
 								apparently logical arguments, the criteria are altered so that flights
 								may still be certified in time. They therefore fly in a relatively
 								unsafe condition, with a chance of failure of the order of a percent (it
 								is difficult to be more accurate).
 								Official management, on the other hand, claims to believe the
 								probability of failure is a thousand times less. One reason for this may
 								be an attempt to assure the government of NASA perfection and success in
 								order to ensure the supply of funds. The other may be that they
 								sincerely believed it to be true, demonstrating an almost incredible
 								lack of communication between themselves and their working engineers.
 								In any event this has had very unfortunate consequences, the most
 								serious of which is to encourage ordinary citizens to fly in such a
 								dangerous machine, as if it had attained the safety of an ordinary
 								airliner. The astronauts, like test pilots, should know their risks, and
 								we honor them for their courage. Who can doubt that McAuliffe was
 								equally a person of great courage, who was closer to an awareness of the
 								true risk than NASA management would have us believe?
 								\[**F5**\] Let us make recommendations to ensure that NASA officials
 								deal in a world of reality in understanding technological weaknesses and
 								imperfections well enough to be actively trying to eliminate them. They
 								must live in reality in comparing the costs and utility of the Shuttle
 								to other methods of entering space. And they must be realistic in making
 								contracts, in estimating costs, and the difficulty of the projects. Only
 								realistic flight schedules should be proposed, schedules that have a
 								reasonable chance of being met. If in this way the government would not
 								support them, then so be it. NASA owes it to the citizens from whom it
 								asks support to be frank, honest, and informative, so that these
 								citizens can make the wisest decisions for the use of their limited
 								resources.
 								For a successful technology, reality must take precedence over public
 								relations, for nature cannot be fooled.