SPOF analysis, SysAdmin Style




Questions addressed in this document:

  • What is a SPOF analysis?
  • What should I include in a SPOF analysis?
  • What should I measure in a SPOF analysis?
  • What is MTBF?
  • What is MTTN?
  • What is MTTR?
  • What constitutes failure?
  • How can I use this information to define risk?
  • How can I determine the criticality of a component?



What is a SPOF analysis?

A Single Point of Failure (SPOF) analysis is a systematic analysis
of what can go wrong in your environment, and what impact each failure
would have. It details the interdependencies and relationships among
the major components in your environment.




What should I include in a SPOF analysis?

A SPOF Analysis of an SA's environment should include the following, at a minimu
m:


  • A list of all network-attached components
  • A list of all services

Though this list is far from complete. It ignores standalone
(non-networked) devices, system components, and applications software,
just to name a few. It also ignores environmental factors such as
power, environmental conditioning, and so forth. Ideally, a SPOF
analysis would encompass every component in an environment, from the
CPUs and DIMMs on the motherboard, to each cable in the walls, to the
pole- or ground-based power transformers external to your facility.






What should I measure in a SPOF analysis?

At a bare minimum, you should note which components on your list will fail
should a given component on your list fail completely, or be removed from
the environment entirely. A more thorough analysis would identify the
various possible failure modes of each component listed, and attempt to
relate failure modes of one component to failure modes of other components.



You should also note the MTBF, MTTN, and MTTR for each failure.





What is MTBF?

Mean Time Between Failures (MTBF) is a measure of the time (typically in
hours) separating one failure of a component from the next. There are
actually two types of MTBF measurements: The specified MTBF for
a component, and its measured MTBF. The specified MTBF is
provided by the vendor, and is generally available for hardware components.
The measured MTBF is an average of the time between failures in your
own environment
. It is a measure of failure frequency for a
component specific to your implementation.


Though the specified MTBF can be useful, particularly when the
component is old and the measured MTBF is unknown, the measured MTBF is
almost always preferred.




What is MTTN?

Mean Time To Notification (MTTN) is a measure of the time that elapses between
the occurrence of a component failure, and notification of the appropriate
personnel to correct said failure. If you work to a specified SLA, or
your monitoring systems are inefficient, you may have a fairly high MTTN.
One of your goals should be to minimize MTTN where necessary (more on this
in the discussion on risk).




What is MTTR?

Mean Time to Recover (MTTR) is the measure of time it takes, on average,
to correct a failure in a given component.









What constitutes failure?

An interesting question in this process is: What constitutes failure?
This may seem a trivial question at first, but the term 'failure' is
meaningless in the abstract. In whose opinion has a failure occurred?
Your users? Your customers? Your boss's? Yours? It is important to
understand and define the perspective from which a lack of functionality
will be deemed a failure.

If your webservers are collocated and your internal NIS slave dies, has
a failure occurred? You may think so. Your users may think so. Your
customers may not notice or care, on the other hand. Therefore, you
may wish to define failure as service availability, and analyze each
component based on service interdependencies.





How can I use this information to define risk?

Once you have settled on a definition of failure, you should consider
the level of importance you wish to attach to a given failure. This can
be done in a number of ways. Failure Mode, Effects, and Criticality
Analysis (FMEA, defined in Military Procedure MIL-P-1629) defines
risk as a product of severity, occurrence probability, and detection
ability.



Using the data collected in a SPOF analysis, you can assign each component
failure a risk value, using this mathematical scaling approach.
To do this, you need to compute values for severity, occurrence probability,
and detection ability.

  • Severity may be computed by counting the total number of components
    affected by a failure, and dividing this sum by the total number of components
    , yielding a value between 0 and 1.
  • Occurrence probability must be defined as a function of time. You
    have the measured MTBF for each component. This value should be assessed
    within a timeframe. You may be interested in the likelihood that a failure
    will occur within a given week. You should therefore divide the number of
    hours in a week by the measured MTBF for that component, to give you a
    value. Any value above one would indicate you should expect a failure of
    that component in a week's time. You may be more interested in failures
    per business quarter, or even year. Adjust the numerator to reflect this,
    and again divide by the measured MTBF to obtain your value.
  • Detection ability may be scaled according to your minimum and
    maximum acceptable detection times. If you assign your minimum acceptable
    detection time (the fastest detection time) a value of 1, your maximum
    acceptable detection time a value of 10, and then subdivide the intervening
    time into 10 discrete ranges, you can assign values to the MTTN according
    to this scaling. For MTTN values below your minimum acceptable time, assign
    a value of 1. Similarly, for MTTN values above your maximum acceptable
    time, assign a value of 10.





You may now compute a risk value as a function of Severity, Occurrence, and
Detection:

r=S*O*D






How can I determine the criticality of a component?

Using the methods described above, you can compute criticality using the
formula:

C=S*O