CACHECHK v4 2/7/96 Copyright (c) 1995 by Ray Van Tassle. This is NOT freeware. This is postcard-ware. Send me a nice picture post card as the registration fee. But if you want to, or if this has been VERY valuable to you, a money contribution will be gratefully accepted. You decide if you should send money, and how much. How could I be any more fair?? Ray Van Tassle 1020 Fox Run Lane Alqonuin, Il. 60102 USA (708)658-4941 CACHECHK performs memory access timing tests, to see if you have a cache, how many caches, and to check the access speed. SAMPLE COMMAND LINES: To log information to a file, for posting or printing--- >del mycpu.log >cachechk My 486, OPTI chipset >>mycpu.log >cachechk -w My 486, memory write >>mycpu.log >cachechk -f My 486, with cache disabled >>mycpu.log There are two kinds of caches, the on-chip cache (level one, or L1) which is in the CPU processor chip, and off-chip cache (level two, or L2) which is on the motherboard. Some 386 motherboards have 64KB of off-chip cache. Some CYRIX 386/486 chips have a very small L1 cache. Most 486's have an on-chip cache, and most new motherboards also have a L2 cache. The Intel 486DX2/66 has 8KB of L1 cache. The Intel 486DX4/100 has 16KB of L1 cache. The AMD 486DX4/100 has 8KB of L1 cache. Intel Pentium has 16KB of L1 cache, 8KB for data and 8KB for code. The typical 486 MB has 256KB of L2 cache, although many will let you install 512KB or even 1MB. Q: HOW MUCH CACHE DO I NEED? For a 486, get a 256K cache. All the writeups that I have seen say: 64KB cache gets a LOT of improvement. 128KB gets a bit more. 256KB gets a teeny bit more. 512KB gets only a teensy weensy bit more. (Note: these are for DOS and WINDOWS3. A real OS, like LINUX, OS/2, and perhaps WIN-95 and WIN-NT may be different.) In general, this seems to be the way all caches work---the first bit gets a lot of bang, and each additional bit gives smaller and smaller improvements. It is claimed that, as long as you have a decent L2 cache, that an AMD 486/100 (with an 8k L1) is virtually identical in performance to an Intel 486/100 (with a 16K L1). At the time this was written...... 256K often uses 8-(32K x 8) SRAM chips (8 * 32K = 256KB), in two banks, and the MB can interleave accesses to the two banks. 512K often uses 4-(128K x 8) SRAM chips (4 * 128K = 512KB, but this is only one bank, so the access time is slower, because the MB can't do bank interleaving. Whether the cache is in one bank or two, depends upon details of the motherboard design, and the chips available at the time. This is expected to change in the future, as different SRAM configurations enter the market. Note the careful use of the word "can"; a particular MB might _not_ do interleaving, een though it could. Also, depending on the exact MB design, the interleaving may or may not be done properly (in the sense of attaining the benefits). Basically, 512K costs a LOT more than 256K and gives only a marginal improvement in performance, so stick with 256K. Q: Why does my main memory show faster with the cache enabled, even out beyond the cache size? A: The L1 cache (on a 486) is filled in granularity of 16 bytes (this is the "cache line size"). When you read a byte, all 16 bytes of that line are read into the cache, in 4 quadbyte (32 bit) units. The quadbyte which is addressed is read from memory first, put into the cache, and transferred to the CPU. Then the other 3 quadbytes are read into the cache. So, if you are accessing the memory in sequential order (like CACHECHK does), the next bytes have been (or are being) automatically sucked into the cache, a "read-ahead", if you will. If the cache is disabled, this read-ahead does not take place. Q: Why is my main RAM memory faster on writing than reading? A: I don't know. If somebody can explain it to me, I'll put it in. Q: How much are the timings dependent on sheer CPU speed, as opposed to the actual memory access speed? A: Probably not very much. There is some loop control overhead in the timing loop, but it is minimal. Each different data point is executed for approximately the same length of time (1/2 second). So the 1kb block size loops for 16 more times than the 16kb block size. So the loop overhead is 16 times greater. But the times are constant (10 us/kb in my system). If the instruction overhead was significant, I would expect to see a "tilt" in the numbers. But I don't, so it's not. And remember, the goal of CACHECHK is to measure the detectible timing improvements of the caching, and not to measure the raw cpu power. Q: How does it measure the CPU speed? A: By measuring how long it takes to do a particular "bsr" instruction 1000 times. Q: No matter how I set the write cache policy (copy-back or write-through), CACHECHK reports the same speed. Isn't copy-back (AKA write-back) supposed to be better? What gives? A: It *is* better, but only when reads and writes are intermixed. By design, CACHECHK does not intermix operations. It either reads in a sustained burst, or writes in a sustained burst, so the write policy never comes into play. CACHECHK is designed to saturate the memory read/write channel and to report the timings in this condition. In fact, in order to be able to detect and time the various stages of the memory system, it is absolutely necessary to throw as much cr*p at it as it can possibly handle. =================================================================== CACHECHK will run the access tests using all the memory in your machine, so that you can check to make sure that all the memory is cached. Usage: CACHECHK -tn -4hfnvwqz? [Optional comments] Cache memory detector & timer. Runs only on 80386 (or better) CPU. -4 = Override CPU check, treat this as a 486. -h? = Print this help text. -f = Perform tests with cache disabled. -q = Quick. Faster but not as accurate. -qq = Each 'q' is 2 times quicker. But less accurate. -tn = Top of memory to test. n = nth MB -v = Verbose. -vv = positively chatty! -xN = Extra tests. N is bitmapped number. "-x7" does all three 1 = --write. 2 = --read with cache off. 4 = --read with byte offsets. -w = Do memory write (otherwise memory read). -z = Slower. Like q, but the other direction. Probably won't be needed on anything less than a 786DX4/200. The "optional comments" just get logged, so you can identify test information along with the results. Mostly useful if you redirect the output to a file. For example: cachechk -q Test with bios set to 1 wait state >before.dat The basic timing loop is 1/2 second per size. Each 'q' cuts the time in half. Each 'z' doubles the time. If the timing figures aren't steady, you probably have too many q's. You can run the tests with the cache disabled, with the 'f' option (on a 486 or Pentium). Naturally, it is re-enabled again when it's done. This generally turns off BOTH the L1 and L2 cache. Your bios setup may (or may not) let you individually enable/disable the caches. Times cache & memory access, and figures out cache size(s). Results are in: microseconds per KB, MB per seconds, and nanoseconds per byte. Memory is access in quadbytes, in flat 32-bit protected mode. For base memory (first MB, MB#0) only 640KB is accessed. Memory accesses are in various block sizes, from 1KB to 2MB. Each megabyte is tested starting at the beginning of that megabyte. CACHECHK will work under a memory manager (HIMEM, EMM386, QEMM, Windows, etc.), but the results may be inaccurate, the machine might crash, and it won't be able to test all of the memory. It will run under WINDOWS, but results are wildly inaccurate. For best results, boot clean--on DOS 6 & above, hit F5 while it boots. It will NOT touch extended memory that is already allocated or in use. If you have a memory manager installed, it usually occupies the first portion of the 2nd megabyte, so CACHECHK will not be able to check that. SOME TIMINGS I HAVE TAKEN CPU L1 L1speed L2 L2speed Mem Speeds (taken from the type siz ns/byte size ns/byt ns/byt ęsec/KB printout) ------ ---- ------- ---- ------ ------ ---------- 386/25SL 0 n/a 64k 80.2 108 84.....114 (Laptop) 386/25 0 n/a 64k 59.2 90 62......94 (Desktop) 486/33 8k 30.7 128k 43.6 70 31..45..73 (Intel) 486/66 8k 16.1 0 n/a 50 15......52 (Intel) 486/100 8k 11.1 0 n/a 46 10......48 (AMD) 486/100 16k 10.0 256k 18.8 26 10..19..27 (Intel) P-75 8k 10.2 256k 16.4 24 10..17..24 Timer ----- CACHECHK directly accesses the timer chip to get a high-precision timer (0.838 microsec resolution). In some motherboards (notibly reported to be "UMC with fake cache chips"), there is a problem with this timer. I worked around this in version 2, but there may be some boards where my work-around still doesn't work. Cache basics ------------ The CPU cache is generally organized in "lines" of 16 bytes. An access to any byte which is not in the cache causes a "cache line fill" operation, which reads in the entire 16 byte chunk. On a 486, memory is 4 bytes wide (32 bits). Using burst-mode, the CPU reads four 32-bit chunks. The numbers you sometimes see, such as 3-2-2-2 or 2-1-1-1, refer to the number of clock cycles to read these chunks. The first one takes longer because of setup time. The 486 CPU can do "out-of-order line fill". If you access the 4th chunk, the memory system will read that chunk in first, pass the addressed bytes to the CPU, then read in the other 3 chunks. The L1 cache is on the CPU itself (AKA on-chip cache), and is usually 4-way set associative. This means that a line can go into any one of four different places in the cache, using an LRU (least-recently-used) algorithm. The L2 cache is on the motherboard. Almost always these are direct-mapped (or "1-way"). This means that a line can go into just one place in the cache. Four-way set associative is better. Actually, the more "ways", the better. Also, more expensive. "Fully associative" is the absolute best. The Cyrix 686 has a small (256 byte) fully-associative instruction-line cache (for instructions only), and a 16KB 4-way set associative cache. The Pentium has 8KB for instructions and 8KB for data, totalling 16K. The other terms you hear are "write-back" (AKA "copy-back"), and "write-through". In write-through, when the CPU changes a byte in memory, it first writes it into the cache, then that cache line is written to the RAM. In write-back, when the CPU changes a byte in memory, it writes it into the cache, and sets the "dirty-bit" for that cache line. Later on, when some other piece of RAM needs to be put in the cache, the memory system will notice that the line is "dirty", and write it to RAM, before reading the new data into the cache line. Write-back is better. Most motherboards have write-back L2 cache. The Cyrix CPUs, and the "enhanced" AMD and Intel CPUs have write-back L1 cache. Other AMD and Intel cpus have write-through cache. Very advanced CPUs and motherboards do other tricks in the write-back stage. Instead of writing the cache line to RAM, then reading the new data into the cache, it will move the data from the cache into a write-buffer, then do the read, then finally write into RAM when there is no read activity. More of these write buffers will give the memory system more chances to delay a write cycle in favor of a read cycle. Again, depending on the exact MB design, the write-back may or may not be done properly (in the sense of attaining the benefits). Motherboard Specsmanship ------------------------ Back in the early days of transistor radios, one selling point was how many transistors a radio had. Some manufacturers (especially Korean and Japanese) would incorporate non-functional transistors, just to be able to be able to (legitimately) claim a higher count. The same thing is possible with MB designs. A "write-back" design without a dirty-tag-ram, or without posted write buffers, will not perform much (or any!) better than a write-through. Ditto for interleaving. Further Reading --------------- I have reluctantly provided more technical detail above that I had originally planned. However, it's by no means a complete discussion. Refer to the pc hardware FAQ (PCHWFAQ, filename PCFAQnnn.ZIP) for much more detailed information. Interpreting the printout ------------------------- CACHECHK v3beta2 11/11/95 Copyright (c) 1995 by Ray Van Tassle. (-h for help) ****** WARNING ******* CPU is in V86 mode! Timings may not be accurate! CMOS reports: conv_mem= 640K, ext_mem= 15,360K, Total RAM= 16,000K ### This is the memory size, as listed in the CMOS (via setup) "GenuineIntel" 486DX4 100 MHz ### The cpu type & speed. Advanced CPUs will identify themselves. For ### others, the type is derived. ### The speed is determined by instruction timing, and is fairly accurate ### (plus/minus 3 mhz) from 386/16 to P5/120. Reading from memory. MegaByte#: --------- Memory Access Block sizes (KB)----- 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 <-- KB 0: 11 11 11 11 11 20 20 20 20 28 -- -- -- ęs/KB ### "n:" is the megabyte number. 0 = base memory, 1 = 2nd MB, etc. ### The numbers are how many microseconds it takes to read/write a certain ### number of bytes. ### We can see that 1KB through 16KB takes 11us/KB, 32KB through 256KB takes ### 20us/KB, and 512 KB takes 28us/KB. This is the base memory, so we can't ### go beyond 640KB. So we stop at 512KB. ### The "--" (for blocks of 1MB, 2MB, and 4MB) means that blocks of this size ### were not tested (because they could not be). ### I go up to 4MB, because some lucky folks have 1MB of cache, and if I ### stopped at a 2MB block, there is only one data point of RAM speed. 2: 11 11 11 11 11 20 20 20 20 28 28 28 28 ęs/KB ### Megabyte #1 is skipped here. Because I have a memory manager (QEMM) ### loaded. It occupies some of the memory at the beginning of MB#1, therefore ### this MB cannot be allocated, therefore I don't test it. ### However, I can check a blocksize of 4MB. This means that it reads from ### address 0x0200000 through 0x05fffff. ### We now have 4 data-points of actual RAM speed. ### By inspection, we see that there are two breakpoints in the memory ### access speed. The first at 16kb, the second at 256kb. This is as it ### should be, as this is an Intel 486/100 (with 16kb of L1 cache), and 256kb ### of L2 cache on the motherboard. An AMD 486/100 has 8KB of L1 cache. ### A Pentium has 8kb data cache and 8kb of instruction cache. ### CACHECHK only tests data, so P5's will show 8kb of L1 cache. 3 4 5 6 7 8 9 <--- same as above. ### The speeds of megabyte 2 thru megabyte 9 are all the same. ### Actually, this is a small fib. Since mb#9 is the last full mb being ### tested, we clearly can't be using block sizes of 2MB or 4MB. And for ### mb#8, we can't use block size of 4MB. Those entries actually have ### "--". But the numbers which *are* there are the same. This machine seems to have both L1 and L2 cache. [read] ### This is reading. Writing will say ["write]". L1 cache is 16KB -- 103.3 MB/s 10.2 ns/byte (262%) (186%) 3.9 clks L2 cache is 256KB -- 55.3 MB/s 19.0 ns/byte (140%) (100%) 7.2 clks Main memory speed -- 39.3 MB/s 26.7 ns/byte (100%) [read] 10.2 clks L1 cache is 16KB -- 102.8 MB/s 10.2 ns/byte (261%) (185%) L2 cache is 256KB -- 55.3 MB/s 19.0 ns/byte (140%) (100%) Main memory speed -- 39.3 MB/s 26.7 ns/byte (100%) [read] ### The L1 cache is 262% (two and a half times) faster than RAM. ### The L2 cache is 140% (one and a half times) faster than RAM. ### The L1 cache is 186% faster (almost twice as fast) as the L2 cache. ### It takes an average of 3.9 clock cycles to read a 32-bit longword from ### the L1 cache into a 32-bit register, 7.2 cycles from L2, and 10.2 from RAM. Effective read RAM access time is 106ns (a RAM bank is 4 bytes wide). ### This is the *measured* access speed of the RAM. On a 486, this ### is 4 times the "main memory speed" (above example is 4 * 26.7), because ### a 486 has a 32-bit (4 byte) path between RAM and the CPU. For a P5, ### this will be 8 times, because the P5 has a 64-bit (8 byte) path. ### Note that this is NOT the speed of your RAM chips---it is the measure of ### how fast your RAM is being driven. ### ### This is VERY dependent on the "DRAM Burst Cycle" settings in your BIOS ### setup. This computer has an AMI bios, where I can set it to certain ### values. The fastest it will allow is 3-2-2-2. The slowest ### it will allow is 5-4-4-4. With this slow setting, I get: ### Main memory speed -- 24.9 MB/s 42.1 ns/byte (100%) [read] ### Effective read RAM access time is 168ns (a RAM bank is 4 bytes wide). ### This is much slower, BUT it might let me use slower (i.e., cheaper) RAM ### chips. ### Note that *many* motherboards will NOT let you fiddle with these settings. ### The 486 (and Pentium) always fill the entire cache line (16 bytes), so ### it is NOT possible to get just one byte from the memory; it always ### grabs all 16 bytes in that block.