Benchmarking Storage Systems White Paper by Ohad I. Unna DynaTek Automation Systems Inc. September, 1992 Benchmarking Storage Systems Abstract Existing disk benchmarking programs fall short in measuring the performance of the storage subsystem as a whole, in the sense of missing the forest for the trees. In order to obtain a meaningful performance index of a storage subsystem that is congruent with the user-perceived usefulness of that system, a benchmark has to measure the performance in an application-level manner, simulating real-life computing environments. A method for constructing such a benchmark is presented in this paper. Forward or Who needs another disk benchmarking program? It is said that if you have a watch you know what time it is, but if you have two -- you are never sure. If you have half a dozen disk benchmarking utilities, you probably have six different sets of results, and there's a good chance none of these results bear any resemblance to the way your system performs under its normal workload. Back in the old days (1982-3) things were pretty simple - PC hard disk drives were invariably spinning at 3600 revolutions per minute, almost always had 17 sectors per track, and built-in bad-sector remapping and multi-megabyte cache were practically unheard of in the microcomputer world. When the IDE and SCSI disk interfaces were introduced to the PC arena, things got really complicated; disk drive vendors started employing sector-mapping schemes in order to avoid the 1024-cylinder restriction, controllers were enhanced with a large onboard cache, and on-the-fly disk compression adapters and software become more and more popular, rendering most of the old disk benchmarking software useless. If you are not familiar with terms such as "sector-mapping" and "compression adapters," don't worry. As a user, you shouldn't really care whether your drive has "skewed-track interleave" or "low BIOS overhead latency," as long as you get good performance out of it. Application Level Benchmarks or Why "your mileage may vary" Have you ever been tempted to buy one of those new "effective access time: 0.5 milliseconds" disk drives? Look before you leap: obtaining such a figure for a current technology drive means the drive manufacturer is using a less-than-adequate disk benchmarking program. Such a program can easily be fooled by the presence of a small onboard cache; by continuously reading and writing data to the same area of the disk, the benchmark merely updates the RAM on the disk controller, while very little disk activity actually takes place. Updating a few Page 2 kilobytes of RAM in one half of a millisecond isn't that impressive, and this will become evident as soon as you start running real-life applications. If you are using a software disk cache, as is now standard on most operating systems, you will hardly notice the presence of an additional 256KB controller-based cache, unless you have a nonstandard configuration or device. Actual storage system performance depends on an assortment of parameters: bus bandwidth, cache strategy, scatter-gather algorithm, cluster size and fragmentation, to name a few. None of these parameters are influenced by the physical characteristics of the specific hard drive used, and each of them has more impact on overall storage system performance than the proverbial "average seek time." The more advanced disk benchmarking programs try to eliminate the effects of operating system level parameters by communicating directly to the disk. Unfortunately, results obtained in this way are about as useful as knowing the top RPM of a car engine when running in neutral. Two cases present even vaguer prospects: networks and disk-arrays. In both cases the same hardware can easily be configured to work in different modes, producing a vast range of performance. With networks, network topology may be redefined, and server configuration may be altered, resulting in an overall system throughput that has little to do with the actual performance of the drives themselves. The second special case is disk-arrays configured as a RAID cluster; the performance of the array as a unit can be as high as twice or more the performance of a single drive or as low as a fraction of the performance of a single drive. The experienced user will try to estimate the importance of each performance feature of the storage system, weigh them against each other, and make a decision based more on intuition than on hard facts. Eventually, the user will run the ultimate benchmark: a test drive of the system with their favorite application. As unscientific as it may seem, the results of this test are more meaningful than any of the standard benchmarks. The application level benchmarking approach is defined as testing a system in a manner that resembles actual usage as closely as possible. This seems to be the most consistent and useful approach to the problem of measuring storage system performance. Page 3 Shortcomings of Existing Benchmarks or Why results are only as good as the measurement tool * Cache Since most existing disk benchmarking programs were designed to give results within a few seconds of activation, they usually perform their tests by moving very small amounts of data. With modern high-end storage devices you could have anywhere from a few kilobytes to 64 megabytes or more of cache between the CPU and the storage system. If a benchmarking program uses standard file accesses, most or all of the data wouldn't even reach the drive by the time the measurement period is over, yielding ludicrous results, such as a transfer rate of 20 megabytes per second. If the cache is embedded within the hardware, as is typical of high-end SCSI adapters, it would present the same problem for benchmarking software which uses direct access to the drive. In order to measure the true effect of a cache, the benchmark should move vast amounts of data during the test, use huge files for disk access, and generally behave like a typical application. A good benchmark shouldn't eliminate the cache effect, but rather measure the true benefit an actual application will gain by the presence of that cache. * Compression Disk compression hardware and utilities are new players in the small-computer arena. Before data is written to the drive it is compressed either by a software device driver or by a dedicated hardware adapter. During reads, data is uncompressed back to its original form. Best performance is achieved when the data is highly regular, such as is common in large spreadsheets or databases. However, as the ratio between CPU speed and disk throughput grows, new applications will have built-in data compression, especially those dealing with highly compressible data, such as video and music. When this happens, the benefits of disk-based compression will diminish. A typical disk benchmarking program uses either fixed data blocks or whatever garbage data happens to be in memory at the time. In either case, the data tends to be highly compressible, resulting in exceptionally high test scores for transfer rate. In order to obtain true results, the benchmark should use data patterns which resemble the actual data as closely as possible. Since this is virtually impossible, the next best thing is to select a reference pattern which can be easily defined and whose maximum compression ratio is known in advance. The only data pattern that meets both of these criteria is an incompressible pattern. Information theory assures us that a suitable pseudorandom sequence has zero redundancy and thus cannot be compressed. As far as the compression logic is concerned, a pseudorandom sequence seems identical to a pre-compressed sequence. Page 4 * Multi-threading A drive with a long seek time may perform very well for sequential accesses, but it will cease doing so as soon as several threads are running concurrently, even if all threads access the same file, and all do so sequentially. The necessity to switch tasks results in multiple seeks up to a point where, with a sufficiently large number of threads, the behavior of the drive will resemble totally random access. The extent to which this effect will degrade performance cannot be computed a-priori using only the knowledge of average random access time and transfer rate, and thus has to be tested per se. A small number (3-6) of concurrent threads is usually enough to feel the full effect of sequential access degradation. * Sequential vs. Random Existing benchmarks generally divide the type of disk accesses into two categories: sequential access and random access. However, in real life, only very few of the applications actually behave this way. A modern disk controller with a high bandwidth and deferred-write caching can almost totally eliminate seek latency for applications performing only a small portion of random accesses. The same drive, when used with a write-through cache, might perform almost as slowly as fully random access. A good benchmark should present the system with the correct mixture of sequential and random accesses in order to get valid results. * Combining reads and writes Much like combining sequential and random accesses, again, a mixture of read and write requests may yield results which differ substantially form the 'expected' weighed average of the two. With the proper logic programmed in the drive controller or device driver, and the appropriate scatter/gather method implemented in the operating system, read and write requests can be overlapped from the CPU's point of view, resulting in superior throughput. Some of the more rudimentary disk benchmarking programs do not perform any writes at all during the test, and simply assume the transfer rate is the same as for reads. This assumption falls apart when RAID class devices are concerned. Depending on RAID level and architecture, the time needed to write a block may differ from the time it takes to read the same block by a factor of two or more. An inherently sluggish RAID device may seem exceptionally fast when tested by only reading existing data, or when the mixture of read and write operations does not match the actual application mixture. Page 5 Defining a Reference Disk Benchmark or What's a RAIDmark? The DynaTek RAIDmark utility is intended to measure the integrity, performance, and cost-efficiency of any read/write random access (blocked I/O) storage subsystem in terms that are meaningful for the layman, and that depict the apparent usefulness the user gets from the system. RAIDmark is also intended to measure some low-level metrics of the storage device, which can be utilized for testing and fine-tuning such devices throughout the phases of on-site installation. The RAIDmark utility defines a set of six standard tests, each simulating a typical I/O-bounded computer environment. For each test, the tested system receives a grade proportional to its throughput. A unit of 'One RAIDmark' is defined as the measured throughput (for each test) of an IBM PS/2 model 95 with the standard IBM 150 MB hard disk. Integrity is tested on a go/no-go basis using 32-bit checksum verification on each record. * Low-level performance The utility measures raw indexes of storage device performance: average access time, read/write transfer rate, effective cache size and cache strategy. * High-level performance The utility tests the performance of the storage subsystem on the application level, checking the subsystem's ability to respond to harsh real-world data traffic of typical heavy load environments. The utility simulates data flow of several typical work environments and measures system response and throughput in each of the test cases, giving a combined figure of merit for the system's performance in each scenario. The following scenarios are simulated and tested: - Typical workstation applications - multiuser (3-10 users) environment, many small- to medium-size file requests, most of which are read-only accesses with less frequent large file read/write operations (as in CAD/CAM file save or memory swap). - Transaction processing environment - 5-50 users, many small record read and update requests, reports and backup processes running in the background, transaction logbook as a large sequential file. - Video/audio/multimedia applications - single or few users, huge sequential access files, with both read and write requests. Page 6 - Database query systems - 5-200 users, many small record accesses, almost all of which are read-only; updates, reports and backup are executed during system off-line time. - Data logging - up to several hundreds of sites, many very small random write accesses (typically 20-100 bytes), reports and backup are executed during system off-line time. - Archiving - few users with large sequential read/write accesses. * Degraded subsystem performance When the subject of the tests is a RAID-class storage system, the user can disable one of the subsystem elements, and then proceed to measure the degraded performance in this hindered setup. * Design constraints and limitations - Free disk space In order for the RAIDmark utility to conduct the tests it should be able to create files of several megabytes each, or even larger, if caching effects are to be measured accurately. For best accuracy, at least 150MB of free disk space should be available on the tested drive. - Operating system interface When using a multi-processing environment, all benchmarks should be run as a sole process, otherwise time measurements will be of little value. - Integrity and stability The ability of RAID-class devices to operate with parts of the subsystem inoperable (or altogether missing) might create a situation in which the user is not aware that the system is operating under highly unfavorable conditions which adversely affect the benchmark. Moreover, modern disk controllers can overcome some physical mishaps, such as bad sectors, unaligned heads, etc., relying on firmware logic to correct these problems at the expense of degraded performance. The user should identify and isolate such conditions, possibly by testing for stability over time and over different areas of the disk platters. - DOS FAT Latency When using top-level MS-DOS file I/O requests, each random file seek requires the operating system to scan the file allocation table (FAT) from the location pointing to the beginning of the file, to the location pointing to the Page 7 requested cluster. Although the FAT is usually kept in RAM, for large files this sequential scanning process may actually take longer than the physical movement of the disk arm. Some advanced applications (database managers, etc.) overcome this delay by keeping several logical entry points to the file ("file handles") scattered at different points in the file. Access is then achieved using the nearest file handle, reducing FAT latency by a factor of 10 to 20. Other applications, such as MS-Windows 3.0/3.1 Virtual Memory Manager, implement their own file system to achieve zero FAT latency. More modern operating systems, such as OS/2 HPFS, achieve the same favorable results using advanced built-in random seek methods. For the sake of being as general as possible, RAIDmark measures the exact FAT latency on the tested system by comparing the average access times for clusters located in a limited region near the beginning of the file to those of clusters located in a similar size region near the end of the file. The average FAT latency is then deducted from all random access test times, giving the theoretical limit for an 'optimally tuned' application. * Test files Test files are homogeneous streams of random values, logically accessed as an array of 32-bit unsigned integers. The stream is logically subdivided into records of 16 double-words (long-words), each containing an embedded checksum. Each record in a test file contains 16 double-words of 32 bits each. Fifteen of these double-words are generated by a pseudorandom number generator. The 16th value is calculated in such a way that the sum of all words in the record plus the offset of the record from the beginning of the file (in bytes) equals zero (modulo 2^32). This file structure enables the user to create files that are incompressible by nature, but still allows the user to verify the validity of the data with an error-detection probability of near certainty. However, this method requires one multiplication and two addition operations per 32-bit integer during file creation, but only one addition per double-word during read access. To prevent meaningless results obtained from slow processor machines with concurrent disk access, a 32KB block of data (512 records of 64 bytes) is prepared in advance, and records from this block are written to the disk at file creation, with only one word in each 16-word record changed for each write operation to account for different file offsets. This will give even a slow CPU ample time to complete data preparation before the next write operation is due. Page 8 Theoretically, data created in the method described above can be compressed. However, there is currently no compression algorithm capable of detecting the 256Kbit-long patterns created using this method. A question arises as to the extent to which this random-data file serves as a good test case, since in reality most data files are compressible by 30%-60% or, in some homogeneous data acquisition applications, up to 80%; however, using compressible data exposes us to the vast range of compression algorithms, each having its own peculiarities regarding specific data patterns, sometimes resulting in a twofold factor in throughput for some specific instance. Hence, the approach taken in RAIDmark is the worst-case compression scenario, in which all algorithms behave in a similar fashion, namely, no compression occurs. In a future version the user may be given the option of generating semi-random ("pink noise") data files, permitting the compression mechanism to affect results. - Test File Creation If any of the disk tests is requested, a 128MB file is created. This process is timed as accurately as possible, both the initial file open operation, and the writing of data on the disk. Measurement is done separately for chunks of data of varying sizes. The first two chunks are 16KB long, the third is 32KB long, the fourth -- 64KB, and so on, with the 8th chunk being 1MB long. The remaining 1MB chunks are written to bring the total to the final size. The 128MB file size was chosen arbitrarily, being larger than today's biggest disk cache units, but less than half the size of modern high capacity disks. This is a default value, intended to become the 'RAIDmark standard,' but values ranging from 1MB to 1GB are implemented and selected using a command line option. Test files smaller than 16MB might produce inaccurate results, and the user will be warned in case such a selection was made. Using the default file size, each separate test will take between a few seconds and two minutes to complete, permitting accurate measurement without seizing the system for an intolerable duration. There are 14 separate tests in a full test-suite, so with all the testing overhead, running the full benchmark may take the better part of an hour. - Determine deferred-write cache size During the file creation phase, effective transfer rate throughput is measured for increasing data chunks as described above. If the system contains a deferred-write Page 9 cache, the first few chunks will take a very short time to be written, usually exhibiting a throughput in excess of 10 megabytes per second. At some point, throughput will decline sharply. This point in the data flow marks the end of the posted write cache. With systems employing dynamic caching, the value obtained for the cache size may be arbitrary, but if the test is run on an otherwise idle system, the size detected will be close to the maximum possible for that system. Note that in order for this test to run accurately, the system should be given enough time to "rest" before the test commences in order to enable previously cached data to be flushed to disk by the cache logic. The user should wait until all previous disk activity ceases before starting this test. - Determine raw transfer rate for write After obtaining an upper bound for the posted-write cache size, the actual non-cached transfer rate for write operations is measured by first writing data to a different location on the drive. The amount of data written should exceed the size of the computed deferred-write cache. The benchmark then compares total write times for data chunks of different sizes, specifically 8KB and 32KB blocks. By subtracting the two, all fixed-overhead delays are canceled out, and the net result is the time it takes to write 24KB of data. This figure for transfer rate will usually be lower than the nominal figure quoted by the drive manufacturer, since the measurement involves drive head movements (seek and search operations), operating system and BIOS overhead, and cache-miss delays. However, the figure obtained this way is a more meaningful measure of subsystem performance, since this is the actual throughput as experienced by the application and the end user. Delays created by the benchmarking process utility itself are taken into account and removed from the delay measurement. During the "post-cache" phase of writing, each of the 1MB chunks of data should take approximately the same time to complete. If the variability is higher than expected, the user will be warned of the system's instability. This situation may arise when other processes in the system or the network are active, when the cache strategy involves some exotic algorithm, when the disk is highly fragmented, when many bad sectors are scattered on the disk's test area, or when a "flaky" drive causes retried attempts for some of the operations. The user has the choice of repeating the Page 10 test, or else referring to the user guide for troubleshooting advice. - Determine raw transfer rate for read Transfer rate for sequential read access is measured by comparing total read times for data chunks of different sizes, specifically 8KB and 32KB blocks. By subtracting the two, all fixed-overhead delays are canceled out, and the net result is the time it takes to read 24KB of data. - Main test executive The main test executive procedure is a generic routine whose input parameters define the real-life environment each test case is supposed to simulate. Parameters include: - record size - distribution of access type - number of users or processing threads The procedure determines the number of times the tests are to be repeated as a function of the duration of a single test. The rule is that, by default, each test case should take from a few seconds to around two minutes. If the total desired testing time of the complete benchmark is given on the command line, the duration of each single test is adjusted accordingly. This time does not include benchmark overhead, such as file creation and cache overflowing, hence total benchmark duration may be considerably longer than the command line parameter. - Build random block This routine creates a 32KB block in memory for writing onto the disk during the file creation phase. The format of the block is described above. The algorithm used for generating random numbers is the Linear-Congruential method (described in D.E. Knuth, "The Art of Computer Programming" Vol. 2). The location used for inserting the checksum word for each record is the location (first four bytes) of each record. In order to account for different file offsets, this location is updated for each instance of writing the 32KB block. Page 11 Conclusion In order to help the user select a storage system and configuration that will match their computing needs, a standard reference benchmark for measuring storage system performance has been defined. Existing disk benchmarking programs fall short when measuring modern 'smart' storage systems, or when unusual disk configurations are used. Many of the most important performance factors are not taken into account, and the characteristics of the user's application don't play any role in the measurement when these utilities are used. The DynaTek RAIDmark has been created to overcome these limitations, operating as an application-level benchmark program. RAIDmark measures storage system performance in various typical environment conditions, and provides the user with a good indication of the expected usefulness of their system. * Technical reference - Patterson, David A.; Gibson, Garth; Katz, Randy H.: "A Case for Redundant Arrays of Inexpensive Disks (RAID)"; ACM SIGMOD (Chicago, 1988). - Patterson, David A.; Chen, Peter; Gibson, Garth; Katz, Randy H.: "An Introduction to Redundant Arrays of Inexpensive Disks (RAID)"; IEEE 1989. - Integra Technologies, Inc. - Graham, Richard A.: "Redundant Arrays of Inexpensive Disks" (1991) - Finney, Kenneth C.; Graham, Richard A.: "A Discussion of RAID Technology"; DynaTek/Integra, 1992. DynaTek and RAIDmark are trademarks of Dynatek Automation Systems Inc. HPFS, IBM, Integra, MS-DOS, MS-Windows, OS/2 and PS/2 are registered trademarks of their respective holders. Page 12