From: e9126050@fbma.tuwien.ac.at (Peter Kofler) Newsgroups: alt.lang.asm Subject: 80486 optimizations Date: 14 Dec 1995 11:13:36 GMT Message-ID: <4ap0t0$3rl@news.tuwien.ac.at> Organization: Vienna University of Technology, Austria Summary: Optimizing techniques for i80486 Lines: 353 Hello! I have just collected some information about optimizing on 80486, and because there are always people interested in this, I posted my summary of various documents and techniques: Optimizing your code for speed on i486 ---------------------------------------- > SUMMARY: This text is a summary of various documents about optimizing your code for speed, especially on 80486. For the original sources look at the references at the end of this file. This document does NOT wand to replace the other docs, but gives a short overview about optimizing on i486. > ALGORITHMIC OPTIMIZATIONS: Optimizations which apply to any processor: - Do not calculate values more than once, use pregenerated tables - Put invariant code outside loops - Perform Loop unrolling Example for clearing a buffer at [ds:di]: mov BX,2 mov CL,4 ; divide the count of words to clear by shr SI,CL ; 16, because we'll clear 16 words ; each time through the loop StoreLoop: REPT 16 ; clear 16 words in a row without looping mov [DI],AX ; clear the current word add DI,BX ; point to the next word ENDM dec SI ; count off blocks of 16 words to clear jnz StoreLoop ; until none remain Loop unrolling is even better, if the loopcound does not vary, because conditional jump overhead (3 (core) cycles because we branch) is removed. - Do not use Floating-Point Math, use integers or fixed Reals instead. - Use assembler for time-critical code, because no compiler is able to optimize like the human programmer (in assembler you are even able to overlap integer and Floating-Point operations.). > USE THE REGISTERS, AVOID ACCESSING MEMORY OPERANDS: - Keep memory accesses to a minimum by avoiding memory operands. - Use the (E)AX as much as possible. Many instructions are shorter when using (E)AX. That's less code you and slightly less internal processing. - Use the DS register as much as possible for the same reason. Other segents need an additional clock (segment prefix). - Try to use the ESP register to reference the stack in lower levels of subroutines (faster than push/pop things and leaves EBP free for other uses) > REGISTER ACCESS CPU STALLS: for 1 clock occur when modifying the lower word of a DWORD register that's used immediately after it. Example: mov AL,0 mov [EBP], EAX ; a one clock penalty > ADDRESSING: AVOID DISPLACEMENT AND INDEX, USE BASE: - In the case of several references being made to a variable addressed with a displacement, load the displacement into a register and use it as base. - Avoid displacements, they cause AGIs (comes later) and need more time. - Pull 32 bit address calculations into load and store instructions. Memory reference instructions have 4 operands: Base[Reg+Reg*INDEX+DISPLACEMENT] Often several integer instructions can be eliminated by fully using the operands of memory addresses. - But avoid using registers as an index in 32 bit address calculations (Reg*INDEX) because this needs one extra clock cycle, if you use other Reg than BX and BP (is this true ?) > ADDRESS GENERATION STALLS (AGI): An Address Generation Interlock occurs when a register which is currently being used as the base or an index was the destination component of a previous instruction. AGI's can only occur on adjacent instructions. Example: add SI, 4 // AGI, ONE CLOCK STALL mov DX, [SI] Even instructions that read or write implicitly to registers cause AGI's. Example: mov ESP, EBP // AGI, ONE CLOCK STALL ; pop uses the ESP as a pointer pop EBP > AVOID SEGMENT REGISTERS AND PREFIX CODES: - It takes an ungodly long time to load a segment register, so load segment registers as infrequently as possible (3-9 cylces instead of 1). - Minimize the use of far pointers as dearly much as you can. If you are doing a lot of processing on a small piece of data far away, it might be faster just to copy the data to nearby and work on it there. - Avoid prefix bytes, they increase instruction size and cost at least one extra cycle (0Fh is also a prefix for the 2nd opcode map). > AVOID CONSTANTS: (see also AVOID DISPLACEMENTS) - Registers should be cleared with "xor Reg, Reg", which is fast but sets up conditions codes. A slightly slower way to do it of course is to "mov Reg,0" which preserves condition codes. - Keep constants and displacements out of critical loops. For example, you should replace: ADDLOOP: MOV DWORD PTR BaseTable[EDX+EBX],0 ADD EBX,4 DEC ECX JNZ ADDLOOP with: LEA EBX,BaseTable[EDX+EBX] SUB EAX,EAX ADDLOOP: MOV [EBX],EAX ADD EBX,4 DEC ECX JNZ ADDLOOP > INSTRUCTION DECODING STALLS: An instruction with an IMMEDIATE VALUE AND an IMMEDIATE DISPLACEMENT has a one clock penalty. Example: mov spam, 248 mov dword ptr [ESP+4], 1 > USE 32 bit ACCESS: Use 32 bit instructions and if you can, align data on 32 bit. But do not use it, if you do not need because 32 bit instructions in 16 bit mode need at least one extra cycle (and vice versa) due to operand size prefix 66h. Let's say you have and array of unsigned words and need to multiply them for a constant, say 45, how can you do that? If you know the values won't overflow the 16 bit they use, you can do the following: mov ECX,count mov ESI,start_address handle_two_words: mov EAX,[ESI] ; load two words at a time ; an AGI happens, but can't eliminate it lea EAX,[EAX+EAX*4] ; multiply by 5 add ESI,4 ; increment pointer, avoid AGI lea EAX,[EAX+EAX*8] ; then multiply by 9 dec ECX ; decrement counter mov [ESI],EAX jne handle_two_words > USE SIMPLE INSTRUCTIONS, THE "LOAD/STORE STYLE": - Most oddball instructions, from AAA to XCHG are much slower than the core instructions(ADC, ADD, AND, CALL, CMP, DEC, INC, Jcc, JMP, LEA, MOV, OR, POP, PUSH, RET, SBB, SUB, TEST, and XOR). - Avoid using complex instructions like LEAVE, ENTER, LOOP, string instructions etc. Simpler instructions will get the job done faster. So use the "Load/Store style" code generation. - Use "dec CX; jnz" (twice as fast) instead of "loop looptop" "push EBP; mov EBP,ESP; sub ESP,BYTE_COUNT" instead of "enter" "add AX,AX" instead of "shl AX,1" "mov Reg,Mem, push Reg" instead of "push Mem" "mov" 3 times instead of "xchg" (3 cycles instead of 5) > DON'T USE STRING INSTRUCTIONS OTHER THAN REPed: - String instructions are so slow that you should not use them other than the always superior REP MOV's. ("LODSB" takes 5 cycles but "MOV AL,[SI]; INC SI" take only 2 cycles) String instuctions need nearly twice as much time as short instructions. - If possible use REP Stringop. REP needs nearly no time for each loop: "rep lodsw" ................. 70, "mov; dec CX; jnz" .......... 95, "mov (16x); dec CX; jnz;" .. 75, "mov (256x); dec CX; jnz;" .. 73, > DON'T BRANCH: - If you have to use conditional jumps, the more often used code should follow if the branch is NOT taken. - "JCXZ" takes 2 times longer than "JZ", so avoid it. - Avoid Near Jcc's because they take 1 cycle longer (prefix 0f). > JMP SHORT, CALL NEAR: - Jump instruction come in three forms, short (rel8), near (rel16/32) and far. The short form is quite a bit faster. To use short jumps (when you know it is possible), use "jmp short" instead of plain "jmp". - If you have far procedures which are called from within the same CS, use "push CS; call near proc" instead of "call proc", which is faster. - Destination should be in the cache and use short instructions for looptops: for instructions longer than a dword, add at least 3 (bus) cycles, longer than 16 bytes, add at least 5 (bus) cycles. > TEST vs CMP: - When comparing a value in a register with 0, use the TEST command. TEST Use test when comparing the result of a boolean AND command with an immediate constant for equality or inequality if the register is (E)AX. You can also use it for zero testing, "test EBX,EBX" sets the zero flag if EBX is zero. - Most tests against 0 are not necessary, because ZF is already set (most instructions do thism except mov, lea) - "test Reg, Mem" is faster than "cmp Reg, Mem". - Multiple tests may be avoided by using jmp-tables, so consider: Example: test AX, 2 jnz secondtest test AX, 1 jnz label1 jmp label0 secondtest: test AX, 1 jnz label3 jmp label2 should be mov BX, AX and BX ,3 add BX, BX jmp [table+BX] > DON'T EXHAUST THE PREFETCH QUEUE: The prefetch queue is 32 byte width and loaded from cache. But memory operations have priority for cache so the prefetch queue will not be filled if there are consecutive memory references. The empty queue causes 3 cycles stall. So rearrage instructions, so there is a non memory referencing instructions at least two cycles before the queue gets exhausted. (4 consecutive 32 bit writes will probably stall because of an empty queue) > DATA AND CODE ALIGNMENT: - Misaligned memory access costs 3 cycles. - Data for word-access should not cross 4 byte boundaries, ie. address is x00b, x01b, x10b, but not x11b. Dword-access should be aligned by 4. - To speed up the instructions, alignment of code should be on the beginning of cache boundaries (16 bytes), so align labels if they are less than 8 bytes away from a cache line boundary. Code alignment matters only at branch targets, because only the first instruction fetch after a branch can suffer from nonalignment. > USE CACHE: - Cache should fill complete before subsequent accesses to the same line: If a read misses the cache or a read/write to the line being filled occurs during fill the intruction must wait until fill is complete. - Use "short" loops so there is more probability that the code to execute will reside fully into the cache. Remember you can count on an external cache of at least 64k (usually 128k..256k). Also keep instructions short. So, if you have to process big arrays or big bitmaps with multiple passages do not "scan" all the array for every pass, use a "strip by strip" strategy instead so every "strip" fully fits into the cache and is ready for the second and third pass without cache misses. This technique is called STRIP-MINING, you can include into your program a routine that checks the optimal "strip size" trying a multi-pass test on 64k,128k,256k,512k strips and "complete array" and then sets the optimal "strip size" to use when perfoming "strip-mining" routines. > INTEGER MULTIPLY (SHL and ADD or LEA for 32 bit): - The integer multiply by an immediate can usually be replaced with a faster and simpler series of shifts, subs, adds and lea's (32 bit). As a rule when 6 or fewer bits are set in the binary representation of the constant, it is better to look at other ways of multiplying and not use INTEGER MULTIPLY. - LEA's generally increase the chance of AGI's. However, LEA's can be advantageous because: * In many cases an LEA instruction may be used to replace constant multiply instructions. (a sequence of LEA, add and shift for example) * LEA may be used as a three/four operand addition instruction. * LEA can be advantageous to avoid copying a register when both operands to an ADD are being used after the ADD as LEA needs not to overwrite its operands. * LEA may be used for shifting (faster, to mul 2,4,8, etc than to shl) The general rule is that the "generic" LEA A,[B+C*INDEX+DISPLACEMENT] where A can be a register or a memory location and B,C are registers and INDEX=1,2,4,8 and DISPLACEMENT = 0 ... 4*1024*1024*1024 or (if performing signed int operations) -2*1024*1024*1024 ... + (2*1024*1024*1024 -1 ) replaces the "generic" worst-case sequence MOV X,C ; X is a "dummy" register MOV A,B MUL X,INDEX ;actually SHL X, (log2(INDEX)) ADD A,DISPLACEMENT ADD A,X So using LEA you can actually "pack" up to FIVE instructions into one. Even counting a "worst case" of TWO OR THREE AGIs caused by the LEA this is very fast compared to "normal" code. What's more, cpu registers are precious, and using LEA you don't need a dummy "X" register to preserve the value of B and C. Here the LEA instruction comes in as major cpu booster, for example: LEA ECX,[EDX*2] ; multiply EDX by 2 and store result into ECX LEA ECX,[EDX+EDX*2] ; multiply EDX by 3 and store result into ECX LEA ECX,[EDX*4] ; multiply EDX by 4 and store result into ECX LEA ECX,[EDX+EDX*4] ; multiply EDX by 5 and store result into ECX LEA ECX,[EDX*8] ; multiply EDX by 8 and store result into ECX LEA ECX,[EDX+EDX*8] ; multiply EDX by 9 and store result into ECX And you can combine leas too!!!! lea ECX,[EDX+EDX*2] // AGI lea ECX,[ECX+ECX*8] ; ECX <-- EDX*27 > ZERO-EXTENSION - With 8-bit operands, use the byte opcodes, rather than using the 16/32 bit instructions with sign and zero extended modes. Internal sign extension is expensive, and speed increases can be found by simplifying the process as much as possible. - The MOVZX takes four cycles to execute due to due zero-extension wobblies. A better way to load a byte into a register is by: xor (E)AX,(E)AX mov AL,memory As the xor just clears the top parts of EAX, the xor may be placed on the OUTSIDE of a loop that uses just byte values. It is recommended that 8/16 bit data be accessed with the MOVSX and MOVZX if you cannot place the XOR on the outside of the loop, do the "replacement" only for movsx/zx inside loops. - In most cases, an Integer Divide is preceded by a "CDQ" instruction. This is as divide instructions use EDX:EAX as the dividend and CDQ sets up EDX. It is better to copy EAX into EDX, then arithmetic-right-shift EDX 31 places to sign extend. If you know the value is a positive, use "xor EDX,EDX". > IN, OUT AND TASK SWITCHING - "Out" should be avoided whenever possible on the 486, and likewise for "In". "Out" takes anywhere from 10 to 31 cycles, depending on processor mode and privileges, more than an order of magnitude slower than MOV. - Task Switching is evil. It is slow, real slow. Avoid task switching too often, as more and more of your time will be spent in processing the task switch. - For faster task switching, perform your task switching in software. This allows a smaller processor state to be saved and restored. Anything that shaves time off 75+ clock cycles isn't all that bad. > REFERENCES: The information is taken from + 80X86.TXT (Dr. Dobb's Journal, March 1991 v16 n3 p16(8) by Michael Abrash, found at ftp.cdrom.com /pub/demos/tutor/graphpro.lzh) + INTEL.DOC (List of op codes plus timing info up to 486, supplied at x2ftp.oulu.fi in /pub/msdos/programming/gpe/pcgpe.zip by Mark Feldman) + MILITARY Intel486 Processor Manual (found at www.wais.com /pub/techdoc/) + OPT32.DOC (Intel Application Note AP 500, found at http://ftp.x86.org) + OPTIMIZE.TXT (Posting on the Internet made by Michael Kunstelj, found at x2ftp.oulu.fi /pub/msdos/programming/docs/optimize.zip) + My own experiences end tests, and some discussion at comp.lang.asm. > CONTACT ME! Feel free to contact me with improvements, suggestions, flames, cheques, etc. Any comments welcomed. If you have some information which should be added, send them to me... ------------------------------------------------------------------------------ KOFLER Peter email: e9126050@fbma.tuwien.ac.at ------------------------------------------------------------------------------ - End of OPTIM486.TXT -