Benchmarking Storage Systems

                                White Paper
                                    by
                               Ohad I. Unna

                      DynaTek Automation Systems Inc.




                              September, 1992




                       Benchmarking Storage Systems

                                 Abstract

  Existing  disk  benchmarking  programs  fall  short  in  measuring  the
  performance  of the  storage  subsystem as  a whole,  in  the sense  of
  missing  the forest  for the  trees. In  order to  obtain a  meaningful
  performance  index of a  storage subsystem that  is congruent with  the
  user-perceived  usefulness of that system,  a benchmark has to  measure
  the  performance in an  application-level manner, simulating  real-life
  computing environments.  A method for constructing such a  benchmark is
  presented in this paper.


                                  Forward
                                    or
               Who needs another disk benchmarking program?

  It is  said that if you have a watch  you know what time it is,  but if
  you  have two --  you are never  sure. If  you have half  a dozen  disk
  benchmarking  utilities,  you  probably  have  six  different  sets  of
  results,  and there's  a good  chance none  of these  results bear  any
  resemblance to the way your system performs under its normal workload.

  Back in the old days (1982-3) things were  pretty simple - PC hard disk
  drives were invariably spinning at 3600  revolutions per minute, almost
  always had 17 sectors per  track, and built-in bad-sector remapping and
  multi-megabyte cache  were practically unheard of in  the microcomputer
  world. When the IDE and SCSI disk interfaces  were introduced to the PC
  arena,  things  got  really complicated;  disk  drive  vendors  started
  employing  sector-mapping schemes in order  to avoid the  1024-cylinder
  restriction, controllers were enhanced with a  large onboard cache, and
  on-the-fly disk compression adapters and software  become more and more
  popular, rendering most of the old disk benchmarking software useless.

  If  you  are  not familiar  with  terms such  as  "sector-mapping"  and
  "compression  adapters," don't worry. As  a user, you shouldn't  really
  care  whether your  drive has  "skewed-track interleave"  or "low  BIOS
  overhead latency," as long as you get good performance out of it.


                       Application Level Benchmarks
                                    or
                        Why "your mileage may vary"

  Have you  ever been tempted to buy  one of those new "effective  access
  time: 0.5  milliseconds" disk drives?  Look before you  leap: obtaining
  such  a  figure  for  a   current  technology  drive  means  the  drive
  manufacturer is  using a less-than-adequate disk benchmarking  program.
  Such a program can easily be fooled by  the presence of a small onboard
  cache;  by continuously reading  and writing data to  the same area  of
  the disk, the benchmark merely updates  the RAM on the disk controller,
  while  very little disk activity  actually takes place. Updating a  few


                                                                   Page 2




  kilobytes of  RAM in one half  of a millisecond isn't that  impressive,
  and  this will become  evident as soon as  you start running  real-life
  applications.  If  you are  using  a software  disk  cache, as  is  now
  standard  on  most  operating  systems,  you  will  hardly  notice  the
  presence  of an  additional 256KB  controller-based  cache, unless  you
  have a nonstandard configuration or device.

  Actual  storage   system  performance  depends  on  an   assortment  of
  parameters:  bus bandwidth, cache  strategy, scatter-gather  algorithm,
  cluster  size  and  fragmentation,  to   name  a  few.  None  of  these
  parameters  are  influenced  by the  physical  characteristics  of  the
  specific hard drive  used, and each of them has more impact  on overall
  storage system performance than the proverbial "average seek time."

  The  more advanced  disk  benchmarking programs  try to  eliminate  the
  effects of operating system level  parameters by communicating directly
  to the  disk. Unfortunately, results obtained in this way are  about as
  useful as knowing the top RPM of a car engine when running in neutral.

  Two cases  present even vaguer prospects: networks and  disk-arrays. In
  both  cases the  same  hardware can  easily be  configured  to work  in
  different modes, producing a vast range  of performance. With networks,
  network  topology may  be redefined,  and server  configuration may  be
  altered, resulting  in an overall system throughput that has  little to
  do  with the actual  performance of the  drives themselves. The  second
  special  case  is  disk-arrays  configured  as   a  RAID  cluster;  the
  performance of the array as a unit can  be as high as twice or more the
  performance  of  a  single  drive  or as  low  as  a  fraction  of  the
  performance of a single drive.

  The  experienced  user will  try  to estimate  the importance  of  each
  performance  feature of  the storage  system, weigh  them against  each
  other, and make a decision based more on intuition than on hard facts.

  Eventually, the  user will run the ultimate benchmark: a test  drive of
  the system  with their favorite application. As unscientific as  it may
  seem,  the results of  this test are  more meaningful  than any of  the
  standard benchmarks.

  The  application level benchmarking  approach is  defined as testing  a
  system in a manner that resembles  actual usage as closely as possible.
  This  seems  to be  the  most consistent  and  useful approach  to  the
  problem of measuring storage system performance.












                                                                   Page 3




                    Shortcomings of Existing Benchmarks
                                    or
           Why results are only as good as the measurement tool

  *    Cache

       Since  most existing disk benchmarking  programs were designed  to
       give  results within  a few  seconds of  activation, they  usually
       perform  their tests by  moving very small  amounts of data.  With
       modern  high-end storage devices  you could  have anywhere from  a
       few  kilobytes to 64  megabytes or more of  cache between the  CPU
       and  the storage system. If  a benchmarking program uses  standard
       file  accesses, most or  all of the data  wouldn't even reach  the
       drive  by  the  time  the measurement  period  is  over,  yielding
       ludicrous  results, such as  a transfer rate  of 20 megabytes  per
       second.  If  the cache  is  embedded within  the hardware,  as  is
       typical  of high-end  SCSI  adapters, it  would present  the  same
       problem for benchmarking software which uses  direct access to the
       drive.  In  order to  measure  the true  effect  of a  cache,  the
       benchmark  should move vast amounts  of data during the test,  use
       huge  files for disk access,  and generally behave like a  typical
       application.  A  good  benchmark  shouldn't  eliminate  the  cache
       effect, but rather measure the true  benefit an actual application
       will gain by the presence of that cache.


  *    Compression

       Disk  compression hardware and  utilities are  new players in  the
       small-computer  arena. Before data is  written to the drive it  is
       compressed  either by a software  device driver or by a  dedicated
       hardware adapter.  During reads, data is uncompressed back  to its
       original  form. Best  performance  is achieved  when the  data  is
       highly  regular,  such  as  is common  in  large  spreadsheets  or
       databases.  However,  as the  ratio  between  CPU speed  and  disk
       throughput  grows,  new  applications  will   have  built-in  data
       compression,  especially those  dealing  with highly  compressible
       data, such as video and music.  When this happens, the benefits of
       disk-based compression will diminish. A  typical disk benchmarking
       program  uses either fixed  data blocks  or whatever garbage  data
       happens  to be in  memory at the  time. In  either case, the  data
       tends to  be highly compressible, resulting in  exceptionally high
       test  scores for transfer rate.  In order to obtain true  results,
       the benchmark  should use data patterns which resemble  the actual
       data as  closely as possible. Since this is  virtually impossible,
       the next best thing is to select a  reference pattern which can be
       easily  defined and whose  maximum compression  ratio is known  in
       advance. The  only data pattern that meets both of  these criteria
       is an  incompressible pattern. Information theory assures  us that
       a  suitable pseudorandom  sequence has  zero  redundancy and  thus
       cannot  be  compressed.  As  far  as   the  compression  logic  is
       concerned,   a  pseudorandom   sequence  seems   identical  to   a
       pre-compressed sequence.


                                                                   Page 4




  *    Multi-threading

       A  drive  with  a  long  seek  time  may  perform  very  well  for
       sequential  accesses,  but  it will  cease  doing so  as  soon  as
       several  threads are  running concurrently,  even  if all  threads
       access  the same file, and  all do so sequentially. The  necessity
       to  switch tasks results  in multiple seeks up  to a point  where,
       with a  sufficiently large number of threads, the behavior  of the
       drive  will resemble totally  random access.  The extent to  which
       this effect  will degrade performance cannot be  computed a-priori
       using  only  the  knowledge  of average  random  access  time  and
       transfer rate,  and thus has to be  tested per se. A small  number
       (3-6)  of concurrent threads  is usually enough  to feel the  full
       effect of sequential access degradation.


  *    Sequential vs. Random

       Existing  benchmarks generally  divide the type  of disk  accesses
       into  two   categories:  sequential  access  and  random   access.
       However, in real life, only  very few of the applications actually
       behave  this way. A modern  disk controller with a high  bandwidth
       and  deferred-write  caching  can almost  totally  eliminate  seek
       latency  for  applications  performing only  a  small  portion  of
       random  accesses. The same drive,  when used with a  write-through
       cache,  might perform almost as  slowly as fully random access.  A
       good benchmark should present the system  with the correct mixture
       of sequential and random accesses in order to get valid results.


  *    Combining reads and writes

       Much  like  combining sequential  and  random accesses,  again,  a
       mixture of read and write  requests may yield results which differ
       substantially  form the  'expected' weighed  average  of the  two.
       With  the  proper logic  programmed  in  the drive  controller  or
       device   driver,  and   the   appropriate  scatter/gather   method
       implemented in  the operating system, read and write  requests can
       be overlapped from the CPU's  point of view, resulting in superior
       throughput.  Some  of  the  more   rudimentary  disk  benchmarking
       programs  do not perform  any writes at all  during the test,  and
       simply  assume the transfer  rate is the same  as for reads.  This
       assumption  falls apart  when RAID  class  devices are  concerned.
       Depending  on RAID  level  and architecture,  the time  needed  to
       write a block  may differ from the time it takes to read  the same
       block  by a factor  of two  or more. An  inherently sluggish  RAID
       device  may seem exceptionally  fast when  tested by only  reading
       existing  data, or when the  mixture of read and write  operations
       does not match the actual application mixture.






                                                                   Page 5




                    Defining a Reference Disk Benchmark
                                    or
                            What's a RAIDmark?

  The  DynaTek RAIDmark  utility is  intended to  measure the  integrity,
  performance,  and  cost-efficiency  of  any  read/write  random  access
  (blocked  I/O) storage subsystem in  terms that are meaningful for  the
  layman, and that depict the apparent  usefulness the user gets from the
  system. RAIDmark is also intended  to measure some low-level metrics of
  the storage  device, which can be utilized for testing  and fine-tuning
  such devices throughout the phases of on-site installation.

  The  RAIDmark  utility  defines  a  set of  six  standard  tests,  each
  simulating a  typical I/O-bounded computer environment. For  each test,
  the  tested system receives a  grade proportional to its throughput.  A
  unit of 'One RAIDmark' is defined  as the measured throughput (for each
  test) of an  IBM PS/2 model 95 with the standard IBM 150 MB  hard disk.
  Integrity  is  tested  on  a  go/no-go   basis  using  32-bit  checksum
  verification on each record.

  *    Low-level performance

       The  utility measures raw indexes  of storage device  performance:
       average  access time,  read/write transfer  rate, effective  cache
       size and cache strategy.

  *    High-level performance

       The utility tests the performance  of the storage subsystem on the
       application level, checking the subsystem's  ability to respond to
       harsh real-world data traffic of  typical heavy load environments.
       The   utility  simulates  data  flow   of  several  typical   work
       environments and  measures system response and throughput  in each
       of  the test  cases, giving  a combined  figure of  merit for  the
       system's performance in each scenario.

       The following scenarios are simulated and tested:

       -    Typical  workstation applications  -  multiuser (3-10  users)
            environment, many  small- to medium-size file requests,  most
            of  which are  read-only accesses  with  less frequent  large
            file  read/write  operations  (as in  CAD/CAM  file  save  or
            memory swap).

       -    Transaction processing  environment - 5-50 users, many  small
            record   read  and  update   requests,  reports  and   backup
            processes running  in the background, transaction logbook  as
            a large sequential file.

       -    Video/audio/multimedia  applications - single  or few  users,
            huge  sequential  access  files, with  both  read  and  write
            requests.



                                                                   Page 6




       -    Database  query  systems -  5-200  users, many  small  record
            accesses,  almost  all  of  which   are  read-only;  updates,
            reports and backup are executed during system off-line time.

       -    Data  logging - up  to several hundreds  of sites, many  very
            small  random   write  accesses  (typically  20-100   bytes),
            reports and backup are executed during system off-line time.

       -    Archiving  -  few  users  with  large  sequential  read/write
            accesses.

  *    Degraded subsystem performance

       When the subject of the tests  is a RAID-class storage system, the
       user can  disable one of the subsystem elements, and  then proceed
       to measure the degraded performance in this hindered setup.

  *    Design constraints and limitations

       -    Free disk space

            In  order for the  RAIDmark utility to  conduct the tests  it
            should be able to create  files of several megabytes each, or
            even   larger,  if  caching  effects   are  to  be   measured
            accurately.  For best accuracy, at  least 150MB of free  disk
            space should be available on the tested drive.

       -    Operating system interface

            When  using a  multi-processing  environment, all  benchmarks
            should be run as a sole  process, otherwise time measurements
            will be of little value.

       -    Integrity and stability

            The  ability of RAID-class devices  to operate with parts  of
            the  subsystem  inoperable  (or   altogether  missing)  might
            create a  situation in which the  user is not aware that  the
            system  is  operating  under  highly  unfavorable  conditions
            which adversely  affect the benchmark. Moreover, modern  disk
            controllers can  overcome some physical mishaps, such  as bad
            sectors, unaligned heads, etc., relying  on firmware logic to
            correct   these   problems  at   the  expense   of   degraded
            performance.  The  user  should  identify  and  isolate  such
            conditions, possibly  by testing for stability over  time and
            over different areas of the disk platters.

       -    DOS FAT Latency

            When  using top-level MS-DOS file  I/O requests, each  random
            file  seek requires  the operating  system to  scan the  file
            allocation  table (FAT)  from the  location  pointing to  the
            beginning  of  the file,  to  the  location pointing  to  the


                                                                   Page 7




            requested cluster.  Although the FAT is usually kept  in RAM,
            for  large   files  this  sequential  scanning  process   may
            actually take  longer than the physical movement of  the disk
            arm.  Some advanced  applications  (database managers,  etc.)
            overcome this  delay by keeping several logical  entry points
            to  the file ("file handles")  scattered at different  points
            in the  file. Access is then achieved using the  nearest file
            handle, reducing  FAT latency by a factor of 10 to  20. Other
            applications,  such  as  MS-Windows  3.0/3.1  Virtual  Memory
            Manager, implement their own file system  to achieve zero FAT
            latency.  More modern operating systems,  such as OS/2  HPFS,
            achieve  the same favorable  results using advanced  built-in
            random seek methods.

            For  the  sake of  being  as  general as  possible,  RAIDmark
            measures  the  exact FAT  latency  on  the tested  system  by
            comparing the average access times for  clusters located in a
            limited  region near the  beginning of the  file to those  of
            clusters  located in a  similar size region  near the end  of
            the file.  The average FAT latency is then deducted  from all
            random  access test times, giving  the theoretical limit  for
            an 'optimally tuned' application.

  *    Test files

       Test  files are  homogeneous streams of  random values,  logically
       accessed  as an array of  32-bit unsigned integers. The stream  is
       logically   subdivided    into   records   of   16    double-words
       (long-words), each containing an embedded checksum.

       Each  record in a  test file contains 16  double-words of 32  bits
       each.   Fifteen  of   these  double-words  are   generated  by   a
       pseudorandom  number generator.  The 16th value  is calculated  in
       such  a way  that the  sum of  all words  in the  record plus  the
       offset  of the record  from the beginning of  the file (in  bytes)
       equals  zero (modulo 2^32). This  file structure enables the  user
       to  create files  that  are incompressible  by nature,  but  still
       allows  the  user to  verify  the validity  of  the data  with  an
       error-detection  probability  of  near  certainty.  However,  this
       method  requires one  multiplication and  two addition  operations
       per  32-bit integer during  file creation,  but only one  addition
       per double-word during read access.

       To  prevent  meaningless  results  obtained  from  slow  processor
       machines  with concurrent disk access,  a 32KB block of data  (512
       records  of 64  bytes) is  prepared in advance,  and records  from
       this  block are written  to the disk at  file creation, with  only
       one word  in each 16-word record changed for each  write operation
       to account for different file offsets.  This will give even a slow
       CPU ample time to complete  data preparation before the next write
       operation is due.




                                                                   Page 8




       Theoretically, data  created in the method described above  can be
       compressed. However,  there is currently no compression  algorithm
       capable of detecting the 256Kbit-long  patterns created using this
       method.

       A question arises as to the  extent to which this random-data file
       serves as a  good test case, since in reality most data  files are
       compressible by  30%-60% or, in some homogeneous  data acquisition
       applications, up to 80%; however,  using compressible data exposes
       us  to the vast range  of compression algorithms, each having  its
       own  peculiarities  regarding specific  data  patterns,  sometimes
       resulting  in a  twofold factor  in throughput  for some  specific
       instance. Hence, the approach taken in  RAIDmark is the worst-case
       compression scenario, in which all algorithms  behave in a similar
       fashion, namely, no compression occurs.

       In  a  future  version  the  user  may  be  given  the  option  of
       generating semi-random  ("pink noise") data files, permitting  the
       compression mechanism to affect results.


       -    Test File Creation

            If  any of  the  disk tests  is requested,  a  128MB file  is
            created.  This process  is timed as  accurately as  possible,
            both  the initial  file open  operation, and  the writing  of
            data on  the disk. Measurement is done separately  for chunks
            of  data of  varying sizes.  The  first two  chunks are  16KB
            long, the third is 32KB long, the fourth  -- 64KB, and so on,
            with the  8th chunk being 1MB long. The remaining  1MB chunks
            are written to bring the total to the final size.

            The  128MB file  size was  chosen  arbitrarily, being  larger
            than  today's biggest disk  cache units,  but less than  half
            the  size of modern  high capacity disks.  This is a  default
            value,  intended  to  become  the  'RAIDmark  standard,'  but
            values ranging  from 1MB to 1GB are implemented  and selected
            using  a command line  option. Test  files smaller than  16MB
            might  produce  inaccurate  results, and  the  user  will  be
            warned in  case such a selection was made. Using  the default
            file  size,  each  separate  test will  take  between  a  few
            seconds  and  two minutes  to complete,  permitting  accurate
            measurement  without seizing  the system  for an  intolerable
            duration. There  are 14 separate tests in a  full test-suite,
            so with all the testing overhead,  running the full benchmark
            may take the better part of an hour.


       -    Determine deferred-write cache size

            During  the  file  creation phase,  effective  transfer  rate
            throughput  is   measured  for  increasing  data  chunks   as
            described  above.  If the  system contains  a  deferred-write


                                                                   Page 9




            cache, the  first few chunks will  take a very short time  to
            be written,  usually exhibiting a throughput in excess  of 10
            megabytes per second. At some  point, throughput will decline
            sharply.  This point in  the data flow marks  the end of  the
            posted write  cache. With systems employing dynamic  caching,
            the value  obtained for the cache size may be  arbitrary, but
            if  the test is  run on  an otherwise idle  system, the  size
            detected  will be  close  to the  maximum possible  for  that
            system.

            Note  that in  order for  this  test to  run accurately,  the
            system should be given enough  time to "rest" before the test
            commences  in order to  enable previously  cached data to  be
            flushed  to disk  by the  cache logic. The  user should  wait
            until all previous disk activity  ceases before starting this
            test.


       -    Determine raw transfer rate for write

            After  obtaining an  upper bound for  the posted-write  cache
            size,   the  actual  non-cached   transfer  rate  for   write
            operations is  measured by first writing data to  a different
            location  on the  drive. The  amount of  data written  should
            exceed  the size  of the computed  deferred-write cache.  The
            benchmark then compares total write times  for data chunks of
            different  sizes,  specifically  8KB  and   32KB  blocks.  By
            subtracting the  two, all fixed-overhead delays are  canceled
            out, and  the net result is the  time it takes to write  24KB
            of data.

            This figure for transfer rate  will usually be lower than the
            nominal  figure quoted by the  drive manufacturer, since  the
            measurement  involves drive head  movements (seek and  search
            operations),   operating  system  and   BIOS  overhead,   and
            cache-miss delays.  However, the figure obtained this  way is
            a  more meaningful  measure of  subsystem performance,  since
            this  is   the  actual  throughput  as  experienced   by  the
            application  and   the  end  user.  Delays  created   by  the
            benchmarking  process utility itself  are taken into  account
            and removed from the delay measurement.

            During  the "post-cache" phase  of writing,  each of the  1MB
            chunks  of data should  take approximately  the same time  to
            complete.  If the  variability is higher  than expected,  the
            user  will  be  warned  of  the  system's  instability.  This
            situation  may arise when  other processes  in the system  or
            the  network are  active, when  the  cache strategy  involves
            some  exotic algorithm, when the  disk is highly  fragmented,
            when many bad sectors are  scattered on the disk's test area,
            or when  a "flaky" drive causes retried attempts for  some of
            the  operations. The  user has  the choice  of repeating  the



                                                                  Page 10




            test,   or   else   referring   to   the   user   guide   for
            troubleshooting advice.


       -    Determine raw transfer rate for read

            Transfer  rate  for sequential  read  access is  measured  by
            comparing  total  read times  for  data chunks  of  different
            sizes, specifically  8KB and 32KB blocks. By  subtracting the
            two, all fixed-overhead delays are canceled  out, and the net
            result is the time it takes to read 24KB of data.


       -    Main test executive

            The main test executive procedure is  a generic routine whose
            input parameters  define the real-life environment each  test
            case is supposed to simulate. Parameters include:

            - record size

            - distribution of access type

            - number of users or processing threads

            The  procedure determines the number  of times the tests  are
            to  be repeated as  a function  of the duration  of a  single
            test.  The rule is  that, by default,  each test case  should
            take from  a few seconds to around two minutes. If  the total
            desired  testing time of the  complete benchmark is given  on
            the  command  line,  the  duration of  each  single  test  is
            adjusted  accordingly. This time  does not include  benchmark
            overhead, such as file creation  and cache overflowing, hence
            total benchmark duration may be  considerably longer than the
            command line parameter.


       -    Build random block

            This routine creates a 32KB  block in memory for writing onto
            the  disk during the file  creation phase. The format of  the
            block is  described above. The algorithm used  for generating
            random numbers  is the Linear-Congruential method  (described
            in  D.E. Knuth, "The  Art of  Computer Programming" Vol.  2).
            The  location used for inserting  the checksum word for  each
            record is the location (first  four bytes) of each record. In
            order  to account for different  file offsets, this  location
            is updated for each instance of writing the 32KB block.







                                                                  Page 11




                                Conclusion

  In  order to help  the user select a  storage system and  configuration
  that will  match their computing needs, a standard  reference benchmark
  for  measuring storage  system performance has  been defined.  Existing
  disk  benchmarking programs  fall short when  measuring modern  'smart'
  storage systems, or when unusual  disk configurations are used. Many of
  the most important performance factors  are not taken into account, and
  the  characteristics of the user's  application don't play any role  in
  the  measurement when these  utilities are  used. The DynaTek  RAIDmark
  has  been  created  to  overcome these  limitations,  operating  as  an
  application-level benchmark  program. RAIDmark measures storage  system
  performance  in various  typical environment  conditions, and  provides
  the  user with a  good indication of the  expected usefulness of  their
  system.



  *    Technical reference

       -    Patterson, David A.; Gibson, Garth; Katz, Randy H.:
            "A  Case for Redundant Arrays  of Inexpensive Disks  (RAID)";
            ACM SIGMOD (Chicago, 1988).

       -    Patterson, David A.; Chen, Peter;  Gibson, Garth; Katz, Randy
            H.:
            "An  Introduction to  Redundant Arrays  of Inexpensive  Disks
            (RAID)"; IEEE 1989.

       -    Integra Technologies, Inc. - Graham, Richard A.:
            "Redundant Arrays of Inexpensive Disks" (1991)

       -    Finney, Kenneth C.; Graham, Richard A.:
            "A Discussion of RAID Technology"; DynaTek/Integra, 1992.





  DynaTek and RAIDmark are trademarks of Dynatek Automation Systems Inc.

  HPFS,  IBM, Integra, MS-DOS, MS-Windows,  OS/2 and PS/2 are  registered
  trademarks of their respective holders.












                                                                  Page 12