MTBF A MEASURE OF OEM DISK DRIVE RELIABILITY


IBM STORAGE SYSTEMS DIVISION

INTRODUCTION

As a supplier of small form factor disk drives to original equipment
manufacturers (OEM), IBM's Storage Systems Division (SSD) is committed
to delivering products that lead the industry in quality and
reliability.

Efforts to maximize drive reliability are at the forefront of the
company's design, development, and manufacturing processes.  Progress
in this quest is assessed through the use of a reliability measure
known as mean time between failure (MTBF).

Because MTBF is widely used by suppliers of small-form-factor disk drives
and is regarded throughout the industry as an effective gauge of
reliability, it is important that customer be able to interpret the
relationship between MTBF claims and customer expectations for drive
reliability in their own applications.  In general, higher MTBF
correlates with fewer drive failures; but an MTBF claim is not a
guarantee of product reliability and does not represent a condition
of warranty.

The purpose of this paper is to help clarify the meaning of MTBF by
addressing some frequently asked questions regarding it.

QUESTIONS AND ANSWERS

1.  What is MTBF?

    MTBF is the mean of a distribution of product life times, often
    estimated by dividing the total operating time accumulated by
    a "defined group" of drives within a given time period, by the
    total number of failures in that time period.

2.  What is this "defined group" of drives?

    This is a group of drive that:

    -  have not reached end-of-life (typically five to seven years)
    -  are operated within a specified reliability temperature range,
       under normal usage conditions, and
    -  have not been damaged or abused.

3.  What is considered to be a failure?

    Any event that prevents a drive from performing its specified
    operations, given that the drive meets the group definition
    described in question 2.

    This includes drives that fail during shipment and during what is
    frequently referred to as the "early life period" (failures
    typically resulting from manufacturing defects).

    It does not include drives that fail during integration into OEM
    system units or as a result of mishandling, nor does it encompass
    drives that fail beyond end-of-life.

4.  If I purchase a drive with an MTBF of 1,000,000 hours (114 years),
    can I expect the drive to operate without failure for 1,000,000
    hours?

    No, because the drive will reach end-of-life before reaching
    1,000,000 hours.  For example, a continuously operated drive with
    a five-year useful life will reach end-of-life in less than
    45,000 hours.  But, theoretically, if the drive is replaced with
    another new drive when it reaches end-of-life, etc, and the
    new drive is replaced with another new drive when it reaches
    end-of-life, etc, then the probability that 1,000,000 hours
    would elapse before a failure occurs would be greater than
    30 percent in most cases.

5.  If I purchase 1000 drives with an MTBF of 1,000,000 hours, how
    many can I expect to fail over a five-year period?

    Assuming that any failed drive is replaced with a new drive having
    the same reliability characteristics and that the drives are used
    continuously, then the number of failures, r, (r = approximately
    equals) you can expect is:

                             (1000 drives) x (43,800 hours/drive)
        r (approximately) =  ___________________________________
                             1,000,000 hours/failure

    Therefore r approximately equals 44 failures

    Note that this number is subject to statistical variation (1).

    If the drives are operated for 16 hours per day instead of 24 hours
    per day, then the number of failures you can expect is:

                             (1000 drives) x (29,200 hours/drive)
       r (approximately) =   ___________________________________
                             1,000,000 hours/failure

    Therefore r approximately equals 29 failures

6.  IBM reports a "predicted MTBF."  What does this mean?

    It is very costly and time-consuming to actually measure high
    MTBFs with a reasonable degree of precision.  Therefore, to
    assess the reliability of a new disk drive prior to volume
    production, reliability data from past products and component
    and assembly tests are merged to create a mathematical model of
    the drive reliability characteristics.  The outcome of that
    modeling process is the "predicted MTBF."  After volume production
    gets under way, actual field failure data is used to check the
    validity of the model.

7.  If I buy drives that have a "predicted MTBF" of 1,000,000 hours,
    can I expect to achieve 1,000,000 hours MTBF from those drives?

    Yes, given the conditions stated in question 2.  The actual MTBF
    measured from any specific set of drives will depend on the usage
    and the environmental conditions the drives experience.

    Stressing a drive beyond normal usage conditions may reduce the
    actual MTBF to a point below the "predicted MTBF."  Generally,
    reliability decreases as temperature increases, so drives that
    are operated in warm environments with poor airflow, will tend to
    have a lower MTBF than those operated in cool environments with
    poor airflow.  Drives that experience a high seek rate tend to
    have a somewhat lower reliability than those that experience a
    low seek rate.  And drives that are in portable equipment tend
    to be subject to higher levels of shock and vibration, which
    also degrades reliability.

    Furthermore, because MTBF can only be measured using statistical
    methods, any measurement will be subject to statistical
    variation.  The degree of variation depends on the number of drives
    included in the measurement.  With more devices, less variation can
    be expected.

8.  I have seen the reliability of drives characterized by the "CDF".
    What is CDF?

    CDF is an acronym for "cumulative distribution function."  It is
    a mathematical function that defines the probability that a drive
    will fail prior to some point in time.  For example, a drive with
    a CDF equal to four percent at five years has a four percent
    chance of failing sometime within its first five years of operation.

    CDF can also be used to determine the number of expected failures
    from a group of files.  For example, say that 1000 drives are put
    into service simultaneously.  If the CDF equals four percent at
    five years, then four percent, or 40, drives can, on average,
    be expected to fail after five years.  It should be noted that if,
    when a drive fails, it is replaced with a new drive, the total
    number of failures over the five year period will, on average, be
    higher than 40 since some of the replacement drives may also fail.

9.  Can I compare the predicted MTBFs reported by IBM, with MTBF claims
    by other drive vendors?

    Yes, given that the assumptions behind the claims are the same.
    Because there is not established industry standard for calculating
    or reporting MTBF, other vendors may not include early life failures,
    and/or may not specify the same end-of-life.  In general, differences
    such as these will affect the MTBF claim.

10. Does a predicted MTBF imply a warranty?

    No.  A predicted MTBF provides a reliability indicator for disk drive
    It is not a guarantee of product reliability and does not represent a
    condition of warranty.  Contact your IBM sales representative for
    answers to warranty questions.

********************************************************************

1)  In this example, because of statistical variation, there is
    approximately a 90 percent probability that the actual number
    of failures will be between 33 and 55.