Quick Sort 32 v2.13 Compiled 6 July 1997 32bit protected mode computer speed test program (C) Copyright 1997 by Sami Farin All Rights Reserved mailto:sfarin@ratol.fi http://www.ratol.fi/~sfarin/ Index 1. What's this? 2. Requirements 3. Usage 4. About Qsort 5. Command line arguments 6. Technical (?) stuff 7. Random data asm src 8. Errorlevels 9. Final words 10. More words [1.] - What's this? Quick Sort 32 v2.13 (shortly Qsort) is a computer speed test program. It runs in 32bit protected mode (using DOS32 v3.3 by Adam Seychell). Qsort is programmed in assembly and optimized for Pentium class machines. [2.] - Requirements Depending on the test size, Qsort needs as much (real/virtual) memory as it takes to store the test data. You can test with sizes of 4 bytes (with 'F') - 4 gigabytes. The random numbers are the same every time you run the program (surprise!). Qsort also requires 386+. If FPU is available, it will be used to calculate the speed, since overflows don't occur so easily as with 64bit integers. When FPU is used, max. displayed speed is 42949672.95 kB/s (20000x more than Pentium Pro 180) and without FPU about 42000 kB/s. Minimum for both is about 0.00 kB/s. If the test runs for more than 2^32 microseconds, the kB/s results will not be correct if you don't have a FPU or don't use 'T'. If you have a FPU, elapsed time will be displayed incorrectly if the test runs for longer than 2^32 microseconds (~36 minutes), but the kB/s rate is calculated correctly, though. If using TSC, max. elapsed time is 2^64 CPU core clock cycles (2500-10000 years) and kB/s rate will be displayed correctly. Qsort (actually DOS32) supports DPMI v0.9, VCPI v1.0, XMS v2.0 and v3.0, Int 15h and RAW mode (clear boot). Win95 is also ok, but please note that GUIs etc take >5 MB of memory. It means that Win95 starts swapping like crazy when qsorting with >11MB on a 16 MB machine. And THEN it IS slow! Try closing every program and removing the wall paper if your disk crunches a lot during the test. You can use Qsort to test Win95 swapping speed and hd speed :) Best result is achieved with clear boot (F5 at bootup for MSDOS 6+). Qsort uses timing interrupt of it's own, so TSRs won't slow it down, because they don't get called at 18.2 Hz rate as before. So you should get the same results even when using different DOS and/or BIOS (on the same computer, of course). DOS32 extender takes about 8.5 kB and the Qsort itself about 6 kB unpacked (thank the gods I'm not a VB-programmer). Some debuggers (SoftICE for DOS, TD, GameTools...) don't work with Qsort (actually DOS32). You have to unload them before using Qsort. Sorry. No fix. [3.] - Usage Qsort can be run in many ways, the easiest being "qsort32 [ENTER]". If you don't have a keyboard, try 2xCLICK. If you don't have enough memory, Qsort quits whining. If you've got lots of mem, the test will begin. First it parses the command line (see [5.]). Then Smartdrive and Win95 buffers are flushed and the test data will be generated (you will see 'i'). If you have used command line option 'D', the random numbers will be saved to file "Unsorted.$$$" before flushing. You will get a bit lower results if you use 'D'. 1st and 2nd level caches are not flushed before sorting. 'w' is displayed if you used option 'W'. Since v2.0, Qsort captures timer interrupt to achieve more stabile and precise timings. On my computer, DOS used about 100 microseconds in every timer interrupt (ints happen about 18.2 times per second). And that 100 us is not the same every time. So I decided to set up a timer interrupt of my own. DOS time is not updated while Qsort is sorting the data. Well, then comes the actual test; timer is started and the data sorted. After the sort timer is stopped. Timing accuracy is about 1/(4.77M/4)s ~839ns or 4-17ns with TSC. Please note that the elapsed time is rounded. It is stored more accurately on the FPU (e.g. 345987ns "inside" the FPU, but it's displayed as 346us). But what the heck. Then the sorted data is checked (to make sure it actually IS sorted) and 'c' will be displayed. It the data isn't ok, you will see '!' (but you shouldn't see it, EVER). If you are overclocking and see '!', your computer isn't working correctly... Here some other results you might expect: *=256 kB L2 PB-cache **=512 kB L2 PB-cache Machine Speed (kB/s) with 12000 / 6000 / 1000 kB **AMD K6 208.333 MHz 3053.1 / 3384.1 / 4008.4 Mem timings tweaked a bit **AMD K6 208.333 MHz 2941.1 / 3286.9 / 3967.4 83.333*2.5 MHz **AMD K6 200 MHz 2822.4 / 3132.2 / 3767.4 66.666*3 MHz **AMD K6 200 MHz ------ / 2925.5 / ------ 66.666*3 MHz Pentium II 266 MHz ------ / 3169.5 / ------ 512 kB cache @ 133 MHz Pentium Pro 180 MHz ------ / 2253.1 / 2534.1 Internal 256 kB L2 cache *Pentium 166 MHz 2274.7 / 2542.0 / 3037.7 66.666*2.5 **Cyrix 6x86-P166+ 1985.6 / 2209.6 / 2599.1 ASUS P/I P55T2P4, 64 MB EDO *Pentium 120 MHz 1738.6 / 1927.7 / 2276.7 60*2 MHz *Pentium 120 MHz ------ / 1905.8 / 2259.5 60*2 MHz Cyrix 6x86-P166+ 1580.0 / 1760.0 2088.1 133 MHz CPU, no L2 cache *Cyrix 6x86-P120+ 1459.4 / 1627.5 / 1904.5 100 MHz CPU, 16 MB FPM *Pentium 90 MHz ------ / 1322.1 / 1563.1 486DX 33 MHz ------ / 275.0 / 323.2 8+64 kB cache 386SX 20 MHz ------ / ------ / 45.1 only 4MB mem :) Pentium Pro/II are so "slow" because they can't execute compare and conditional jump in parallel. And I couldn't work out a "fix" for that. PPro's 2nd level cache is 2-4 times faster than any kind of PB-cache, so PPro would beat other Pentiums it if could execute those damn instuctions in parallel! If you can test this program with different memory types (FPMRAM, EDORAM, SDRAM) on the SAME motherboard AND CPU I'd be pleased to know how memory type affects the results. Also 256 kB vs 512 kB might be interesting. And bus speed (33,50,60,66,75,83,100 MHz). Test results for Cyrix 6x86MX and Pentium MMX wanted! Please note that this program is optimized for Pentium class machines (including Cyrix 6x86). It means that 486-optimized version of Qsort would run slower on Pentiums than Pentium-optimized code would and so on. If you used command line option 'D', the sorted data is written to file "Sorted.$$$". It looks very very interesting, check it out if you don't have anything else to do! Be sure to have enough free space on current working directory/disk. After the test has finished, you may do whatever comes in your mind. For example, run it again to see if some nasty interrupts interrupted Qsort so lot that it affected the result. It's always a good idea to run the program at least twice. [4.] - About Qsort I compared Qsort to a bubble sort routine and found out that quick sort algorithm was about 4500 times faster with 1500 integers. And bubble sort slows down exponentially. If you want the source code for Qsort (well commented and so on, of course), please ask me for more info. The executable can be used to any purpose. BUT it (and every file in this package) may NOT be modified in any way. You may NOT reverse- engineer, disassemble or do something as silly to the executable. You may not ask money for Qsort32, since it's FREEWARE! TASM 5, Qedit, SoftICE for Win95 and some other progs were used in development of Qsort. Also lots of optimizing and some thinking happened. If you have an ANSI C-language quick sorting routine that is faster than Qsort, it would be nice to hear about such a beast (and the compiler). It must sort completely random 32bit unsigned integers (as in Unsorted.$$$, see [7.]). [5.] - Command line arguments *** No need for '/' nor '-'. The options are case insensitive. *** Q: Quiet, output only the speed in kB/s R: Reverse, sort reverse order numbers instead of random ones 9,8,7,6,5,... instead of 7,1,4,6,2,... F: File, get numbers from file 'TESTFILE.Q32' Qsort sorts them as 32bit integers (you should know that by now...). ,: Use ',' as the delimiter instead of '.' Outputs '1234,56' instead of '1234.56'. Excel might not accept both of them. L: Add line feed when using option 'Q' Outputs empty line after sort speed when option 'Q' used. x: Integer, 1-[big number], test size is in kilobytes (1024 bytes) '23000' tells Qsort that you want to test with 23000*1024 bytes. M: Megabyte, test size is in megabytes (1048576 bytes) Option "23m" means 23*1048576 bytes. V: On-The-Screen(TM) using 1024x768x256 linear framebuffer mode If you want to see HOW quick sorting algorithm works, you might want to check this out! You must have VBE v2.0+ to use this. UniVBE or S3VBE20 might help if you don't have VBE v2.0+ already on board. Note again that the pixels are sorted as 32bit integers. You will get poor results since reading from video memory is REALLY slow. This option not intended to measure video performance. P: When using 'V' and 'F', get palette from the file (HSI RAW file format) Make the file with makeraw1 (256c B&W) or makeraw2 (256c colour). If 'P' not specified, also the possible palette+header will be sorted. Note: if using 'P' and 'V' without 'F', default palette will be used. Image Alchemy is available at ftp://ftp.handmadesw.com. You can also use any other 256c B&W format, but in that case, don't specify 'P'. You also have to strip the header from a non-HSI RAW file. Z: No delays when using option 'V' If you just can't wait. T: Use RDTSC instruction, found on Intel Pentium+, AMD K5+ and most new CPUs. Timing accuracy of TSC (Time Stamp Counter) is very good, 10ns on a 100 MHz Pentium. And that clock is quite stabile too, +-250 picoseconds (max). You have to specify your CPU's speed in MHz before you can use option 'T', because if the MHz-rate is calculated every time you run Qsort, it will be (a bit) different every time. Also, when multitasking, calculated MHz would be quite inaccurate. The speed must be specified in environment variable 'QSORMHZ' or by specifying it after 'T' ('T:180', 'T=180' or with decimals 'T:166.666', 'T=233.333'). Max number of decimals is three, the rest are ignored. If CR4.TSD=1 and CPL!=0, this option won't work (GPF). Non-braindead operating systems let that bit be 0, so 'T' works OK. Cyrix 6x86 doesn't have TSC! Oh, it's so sad. C: Calculates amount of Memory IO (external memory) of total execution time. It means the count of clock cycles during which the processor's external memory bus was in use divided by elapsed clock cycles * 100%. The larger the amount of Memory IO of the total execution time is, the slower your memory subsystem (2nd level cache and main memory) is. Use 'C' on Pentium and Pentium MMX CPUs. This options also measures how many million machine languge instructions were executed in a second (MIPS) during the test. Absolute maximum value for a normal Pentium is MHz-rate*2, but there are wait states during memory accesses and every instruction can't be executed in parallel, some instructions take more than one clock cycle and such nice little things. "Average number of instructions to sort one integer" is the same for every computer (so it is there just for fun). If the value is not stable with the same command line options, it's due to hardware interrupts (timer, keyb, printer, modem...). This options doesn't need option 'T', but it won't be of any harm. *** This option needs CPL 0, so it won't work under Windows etc. *** *** (RDMSR and WRMSR are privileged instructions...) *** X: Same as 'C', but for Pentium Pro and Pentium II. Use this if 'C' doesn't (seem to) work in CPL 0. I'm not quite sure if the Memory IO value is correct, since Pentium Pro has different counters than Pentium. Qsort uses event 62h to calculate the Memory IO, but it doesn't include 2nd level cache accesses (62h="External Bus Cycles - DRDY Asserted"). As you can see, MIPS value is quite low on a PPro compared to MHz-rate and Pentiums, since PPro can't execute compare+conditional jump in parallel ;( As you know, MSRs needed for options 'C' and 'X' exist only in Intel-powered CPUs (Pentium and up). (At this time, I mean!) D: Debug, dump unsorted numbers to 'UNSORTED.$$$' and sorted to 'SORTED.$$$' ...so you can see that Qsort actually sorts them. W: Wait about 1s before starting the sort. ?: No idea. [6.] - Technical (?) stuff There's an option to specify the test size so that even with 80886 2 GHz + 2 GB mem the result is accurate and meaningful. Don't press "Any Key" during the test, since one keypress might take as much as 1ms to process. Interrupt occurs also when releasing the key. Keyboard buffer is not flushed before the test starts. If you want to be sure that keyboard interrupts do not occur while sorting, use option 'W' and DO NOT dance on the keyboard! If you specify a size which is smaller than 2nd level cache size, speed of main memory won't affect the result at all. See included .XLS files for more info! Them are made with Excel 7. I think it uses Excel 5 format. If you've got old Excel or no Excel at all, SORRY. I didn't release the charts in Excel 4 format, because it was two times larger than in Excel 7 format and the colours got screwed and windows misplaced. You can replace the values with values made by makekbs{1,2}.btm (4DOS needed!). Just open resu*.kbs in Excel and copy/paste them to rndtest.xls and/or revtest.xls. Then change some settings, reboot the machine, rerun .btm files ... and play with Excel to see how the change affected sorting speed with different memory block sizes. You can also make a chart of your own with the .BTMs and Excel or some other program made for that purpose. Qsort does no heavy calculations, so you have to use (FOR EXAMPLE) Safbench to test CPU, FPU and memory speeds thoroughly. Following things affect the result: 1) CPU type (386,486,Pentium,AMD K6,Cyrix 6x86MX...) 2) CPU speed (16,33,66,100,133,180,200,233 MHz...) 3) Bus speed (33,50,60,66,75,83,100 MHz...) 4) Main memory type and speed (FPMRAM,SDRAM,BEDORAM,20,55,70ns...) 5) 1st level cache size, type and speed (8,16,32 kB...(*)) 6) 2nd level cache size, type and speed (64,128,256,512,internal,SRAM) 7) Chipset settings in BIOS (Wait states, RAM timings...) 8) Motherboard design and used chipset 9) Enabled features of the processor (Write gathering, wback,...) (*) Only size of the DATA cache affects the result if you have 1st level cache in the first place. Pentium has 8 kB for data, Pentium MMX 16 kB and AMD K6 has 32 kB. For example, many 'mips-programs' don't use 2nd level cache nor main memory in the stupid little 8086/8087-compatible ten-instuction loop. So all that matters to such programs is CPU type and speed. Did you know: if there are two Pentium 100's, the only thing what makes the speed differences in CPU-related tasks is memory speeds and cache sizes. 1st and 2nd level cache and main memory read AND write speeds are in relationship with the result given by a test program. Every test program uses memory blocks of different sizes, so Pentium A might be faster with a program which uses under 256 kB of mem and Pentium B might be faster with huge programs or programs which access lots of data. Also the type of memory access (random, sequential) makes differences. Try 'Safbench M' to see how fast memory you REALLY have. For example, CacheChk uses slow 'rep lodsd' instuction to read data, which reads one 32bit word per cycle _AT MAX_. Pentiums are capable of reading two of 'em in one cycle. If you are overclocking and you occasionally see '!', it means that your computer doesn't work correctly and you should stop overclocking RIGHT NOW since your data is in danger. You might want to add some wait states for memory (in BIOS setup) and try again if it helped. If you keep getting '!' characters, you should put the jumpers into their original places (unoverclock). If that doesn't help, something is REALLY wrong. [7.] - Random data asm src Here's the source code to produce the random integers used in the test. Cld Mov EAX, 2156620195 Mov EBP, 1582790145 Mov ESI, NumberOf32bitInts Mov EDI, [MemBuffer] ; EDI points to the start of the buffer RandLoop: Ror EAX, 1 Xor EAX, 05AC5C6B3h Mul EBP Rol EBP, 1 Add EBP, 0CA51AB67h Xor EAX, EBP Ror EDX, 3 Add EDX, EAX Add EBP, EDX Rol EBP, 7 Mul EAX Xor EBP, EAX Xor EDX, 01A95E145h Rol EDX, 1 Xor EBP, EDX Xor EAX, EBP StosD Dec ESI Jnz RandLoop Cli Jz $ ; Or something just as genius [8.] - Errorlevels 0 OK 1 Compare failed, integers not in order If this happens, your memory is defective, you have too "fast" settings for memory in BIOS or you are overclocking and your computer doesn't work correcly because of it. If you have overclocked your computer and comparison fails, you should NOT overclock your computer at all or add some wait states etc. 255 Not enough memory 253 TESTFILE.Q32 not found, put it in current dir if you want to use it 252 Specified conflicting command line options, RTFM 251 Help screen displayed 250 Could not get mode info for mode 1024x768x256 249 Linear framebuffer mode 1024x768x256 not supported Try installing UniVBE or S3VBE20 or something 248 Failed to map framebuffer (tell me if you see this) 247 TESTFILE.Q32 too small (<4 bytes) 246 Too big test size specified (>4GB) 245 CPUID not supported Qsort doesn't try to enable that for Cyrix 6x86, since it doesn't have TSC. 244 TSC not supported So you can't use option 'T' 243 No MHz-rate specified It must be in env. variable QSORMHZ or use option 'T:xxx'. 242 Specified MHz was not between 1-1000000000. 241 Pentium compatible MSRs not available So you can't use options 'C', 'X' and 'T' 240 CPL isn't 0 (needed for option 'C' and 'X'), try clear boot Under Win etc Current Privilege Level is 3 [9.] - Final words If you have some bugs or improvements to report feel free to tell me. My email-address is sfarin@ratol.fi and the "nice" homepage is located at http://www.ratol.fi/~sfarin/ ... The most recent Qsort32 is there. WANTED: working exception handlers and register dumpers for Tran's Pmode (v3.07)... REMEMBER: if everything else fails, TRY CLEAR BOOT! [10.] - More words DOS32 is (C) Adam Seychell, Win95 is (C) Microsoft, TASM is (C) Borland, Qedit is (C) Semware, 4DOS is (C) JP Software, Image Alchemy is (C) Handmade Software, Inc., Pentium is (C) Intel etc... You use Qsort and Safbench at your own risk. If you lose 4 MB document because Win boots your computer or something it's your and only your problem. Qsort worked for me, I hope the same for you.