Journal:   Dr. Dobb's Journal  March 1991 v16 n3 p16(8)
-----------------------------------------------------------------------------
Title:     80x86 optimization: aim down the middle and pray. (80x86 family of
           microprocessors) (tutorial)
Author:    Abrash, Michael.
AttFile:    Program:  80X86.ASC  Source code listing.

Summary:   Optimizing code for 8088, 80286, 80386 and 80486 microprocessors
           is difficult because the chips use significantly different memory
           architectures and instruction execution times.  Code cannot be
           optimized for the 80x86 family; rather, code must be designed to
           produce good performance on a range of systems or optimized for
           particular combinations of processors and memory.  Programmers
           must avoid the unusual instructions supported by the 8088, which
           have lost their performance edge in subsequent chips.  String
           instructions should be used but not relied upon.  Registers should
           be used rather than memory operations.  Branching is also slow for
           all four processors.  Memory accesses should be aligned to improve
           performance.  Generally, optimizing an 80486 requires exactly the
           opposite steps as optimizing an 8088.
-----------------------------------------------------------------------------
Descriptors..
Company:   Intel Corp. (Products).
Ticker:    INTC.
Product:   Intel 80286 (Microprocessor) (Programming)
           Intel 80386 (Microprocessor) (Programming)
           Intel 80486 (Microprocessor) (Programming)
           Intel 8088 (Microprocessor) (Programming).
Topic:     Microprocessors
           Optimization
           Programming
           Tutorial
           Assembly Language
           Guidelines
           Type-In Programs
           Microcode
           Processor Architecture.
Feature:   illustration
           graph.
Caption:   Official and actual cycles per binary-to-hex ASCII conversion.
           (graph)
           Actual performance in microseconds of two solutions to a problem.
           (graph)
           Actual performance of three clearing approaches across the 80x86
           family. (graph)

-----------------------------------------------------------------------------
Full Text:

Optimization

Picture this: You're an archer aiming at a target 100 feet away.  A strong
wind comes up and pushes each arrow to the left as it flies.  Naturally, you
compensate by aiming farther to the right.  That's what it's like optimizing
for the 8088; once you learn to compensate for the strong but steady effects
of the prefetch queue and the 8-bit bus, you can continue merrily on your
programming way.

Now the wind starts gusting unpredictably.  There's no way to compensate, so
you just aim for the bull's-eye and hope for the best.  That's what it's like
writing code for good performance across the entire 80x86 family, or even for
the 286/386SX/386 heart of today's market.  You just aim down the middle and
pray.

The New World of the 80x86

In the beginning, the 8088 was king, and that was good.  The optimization
rules weren't obvious, but once you learned them, you could count on them
serving you well on every computer out there.

Not so these days.  There are four major processor types--8088, 80286, 80386,
and 80486--with a bewildering array of memory architectures: cached (in
several forms), page mode, static-column RAM, interleaved, and, of course,
the 386SX, with its half-pint memory interface.  The processors offer wildly
differing instruction execution times, and memory architectures warp those
times further by affecting the speed of instruction fetching and access to
memory operands.  Because actual performance is a complex interaction of
instruction characteristics, instruction execution times, and memory access
speed, the myriad processor-memory combinations out there make "exact
performance" a meaningless term.  A specific instruction sequence may run at
a certain speed on a certain processor in a certain system, but that often
says little about the performance of the same instructions on a different
processor, or even on the same processor with a different memory system.  The
result: Precise optimization for the general PC market is a thing of the
past.  (We're talking about optimizing for speed here; optimizing for size is
the same for all processors so long as you stick to 8088-compatible code.)

So there is no way to optimize performance ideally across the 80x86 family.
An optimization that suits one processor beautifully is often a dog on
another.  Any 8088 programmer would instinctively replace:

DEC  CX JNZ  LOOPTOP

with:

LOOP  LOOPTOP

because LOOP is significantly faster on the 8088.  LOOP is also faster on the
286.  On the 386, however, LOOP is actually two cycles slower than DEC/JNZ.
The pendulum swings still further on the 486, where LOOP is about twice as
slow as DEC/JNZ--and, mind you, we're talking about what was originally
perhaps the most obvious optimization in the entire 80x86 instruction set.

In short, there is no such thing as code that's truly optimized for the
80x86.  Instead, code is either optimized for specific processor-memory
combinations, or aimed down the middle, designed to produce good performance
across a range of systems.  Optimizing for the 80x86 family by aiming down
the middle is quite different from optimizing for the 8088, but many PC
programmers are inappropriately still applying the optimization lore they've
learned over the years on the PC (or AT).  The world has changed, and many of
those old assumptions and tricks don't hold true anymore.

You will not love the new world of 80x86 optimization, which is less precise
and offers fewer clever tricks than optimizing for the 8088 alone.  Still,
isn't it better to understand the forces affecting your code's performance
out in the real world than to optimize for a single processor and hope for
the best?

Better, yes.  As much fun, no.  Optimizing for the 8088 was just about as
good as it gets.  So it goes.

Optimization Rules for a New World

So, how do you go about writing fast code nowadays?  One way is to write
different versions of critical code for various processors and memory access
speeds, selecting the best version at runtime.  That's a great solution, but
it requires an awful lot of knowledge and work.

An alternative is to optimize for one particular processor and settle for
whatever performance you get on the others.  This might make sense when the
8088 is the target processor because it certainly needs the optimization more
than any other processor.  However, 8088 optimization works poorly at the
upper end of the 80x86 family.

Nowadays, though, most of us want to optimize for the 286 and 386 systems
that dominate the market, or across all 80x86 processors, and that's a tough
nut to crack.  The 286 and 386 come in many configurations, and you can be
sure, for example, that a 386SX, an interleaved 386, and a cached 386 have
markedly different performance characteristics.  There are, alas, no hard and
fast optimization rules that apply across all these environments.

My own approach to 80x86 optimization has been to develop a set of general
rules that serve reasonably well throughout the 80x86 line, especially the
286 and 386, and to select a specific processor (in my case a cached 386, for
which cycle times tend to be accurate) to serve as the tiebreaker when
optimization details vary from one processor to another.  (Naturally, it's
only worth bothering with these optimizations in critical code.)  The rules
I've developed are:

* Avoid accessing memory operands; use the registers to the max.

* Don't branch.

* Use string instructions, but don't go much out of your way to do so.

* Keep memory accesses to a minimum by avoiding memory operands and keeping
instructions short.

* Align memory accesses.

* Forget about many of those clever 8088 optimizations, using oddball
instructions such as DAA and XLAT, that you spent years learning.

Next I'll discuss each of these rules in turn in the context of
8088-compatible real mode, which is still the focus of the 80x86 world.
Later, I'll touch on protected mode.

Let's start by looking at the last--and most surprising--rule.

Kiss Those Tricks Goodbye

To skilled assembly language programmers, the 8088 is perhaps the most
wonderful processor ever created, largely because the instruction set is
packed with odd instructions that are worthless to compilers but can work
miracles in the hands of clever assembly programmers.  Unfortunately, each
new generation of the 80x86 has rendered those odd instructions and marvelous
tricks less desirable.  As the execution time for the commonly used
instruction ADD BX, 4 has gone down from four cycles (8088) to three cycles
(286) to two cycles (386) to one cycle (486), the time for the less
frequently used instruction CBW has gone from two cycles (8088 and 286) up to
three cycles (386 and 486)!

Consider this ancient optimization for converting a binary digit to hex
ASCII:

ADD  AL,90H DAA ADC  AL,40H DAA

Now consider the standard alternative:

ADD  AL,'0' CMP  AL,'9' JBE  HaveAscii ADD  AL,'A'-('9'+1) HaveAscii:

As Figure 1 indicates, the standard code should be slower on an 8088 or 286,
but faster on a 386 or a 486--and real-world tests confirm those results, as
shown in Figure 2.  (All "actual performance" timings in this article were
performed with the Zen timer from Zen of Assembly Language, see "References"
for details.  The systems used for the tests were: 8088, standard 4.77 MHz PC
XT; 80286, standard one-wait-state, 8 MHz PC AT; 386SX, 16 MHz noncached;
80386, 20 MHz externally cached with all instructions and data in external
cache for all tests except Listings One and Two; 80486, 25 MHz internally
cached, with all instructions and data in internal cache for all tests except
Listings One and Two.)

In other words, this nifty, time-tested optimization is an anti-optimization
on the 386 and 486.

Why is this?  On the 386, DAA--a rarely used instruction--takes four cycles,
and on the 486 it takes two cycles, in both cases twice as long as the more
common instructions CMP and ADD; in contrast, on the 8088 all three
instructions are equally fast at four cycles.  Also, the instruction-fetching
advantage that the 1-byte DAA provides on the 8088 means nothing on a cached
386.

Nor is this an isolated example.  Most oddball instructions, from AAA to
XCHG, have failed to keep pace with the core instructions--ADC, ADD, AND,
CALL, CMP, DEC, INC, Jcc, JMP, LEA, MOV, OR, POP, PUSH, RET, SBB, SUB, TEST,
and XOR--during the evolution from 8088 to 486.  As we saw earlier, even LOOP
lags behind on the 386 and 486.  Check your favorite tricks for yourself;
they might or might not hold up on the 386, but will most likely be
liabilities on the 486.  Sorry, but I just report the news, and the news is:
Kiss most of those tricks goodbye as the 386 and 486 come to dominate the
market.  (This means that hand-optimization in assembly language yields less
of a performance boost nowadays than it did when the 8088 was king; the
improvement is certainly significant, but rarely in the 200-500 percent range
anymore.  Sic transit gloria mundi.)  Most startling of all, string
instructions lose much of their allure as we move away from the 8088, hitting
bottom on the 486.

The 486: All the Rules Change

The 486 represents a fundamental break with 8088-style optimization.
Virtually all the old rules fail on the 486, where, incredibly, a move to or
from memory often takes just one cycle, but exchanging two registers takes
three cycles.  The nonbranching core instructions mentioned earlier take only
one cycle on the 486 when operating on registers; MOV can, under most
conditions, access memory in one cycle; and CALL and JMP take only three
cycles, given a cache hit.  However, noncore instructions take considerably
longer.  XLAT takes four cycles; even STC and CLC take two cycles each.  The
486's highly asymmetric execution times heavily favor core instructions and
defeat most pre-486 optimizations.

Core instructions do have a weakness on the 486.  While 486 MOVs involving
memory are remarkably fast, accessing memory for an operand to OR, ADD, or
the like costs cycles.  Even with the 8K internal cache, memory is not as
fast as registers, except when MOV is used (and sometimes not even then), so
registers are still preferred operands.  (AND [BX],1 is fast, at only three
cycles, but AND BX,1 takes only one cycle--three times as fast.)

OUT should be avoided whenever possible on the 486, and likewise for IN.  OUT
takes anywhere from 10 to 31 cycles, depending on processor mode and
privileges, more than an order of magnitude slower than MOV.  The lousy
performance of OUT -- true on the 386 as well -- has important implications
for graphics applications.

String instructions are so slow on the 486 that you should check cycle times
before using any string instruction other than the always superior REP MOV's.
For example, LODSB takes five cycles on the 486, but MOV AL,[SI]/INC SI takes
only two cycles; likewise for STOSB and MOV [DI],AL/INC DI.  Listing One
(page 73) uses LODSB/STOSB to copy a string, converting lowercase to
uppercase while copying; Listing Two (page 73) uses MOV/INC instead.  Figure
3 summarizes the performance of the two routines on a variety of processors;
note the diminishing effectiveness of string instructions on the newer
processors.  Think long and hard before using string instructions other than
REP MOVS on the 486.

Optimization for the 486 is really a whole new ball game.  When optimizing
across the 80x86 family, the 486 will generally be the least of your worries
because it is so much faster than the rest of the family; anything that runs
adequately on any other processor will look terrific on the 486.  Still, the
future surely holds millions of 486s, so it wouldn't hurt to keep one eye on
the 486 as you optimize.

String Instructions: Fading Stars

On the 8088, string instructions are so far superior to other instructions
that it's worth going to great lengths to use them, but they lose much of
that status on newer processors.  One of the best things about string
instructions on the 8088 is that they require little instruction fetching,
because they're 1-byte instructions and because of the REP prefix; however,
instruction fetching is less of a bottleneck on newer processors.  String
instructions also have superior cycle times on the 8088, but that advantage
fades on the 286 and 386 as well.

On the 286, string instructions (when they do exactly what you need) are
still clearly better than the alternatives.  On the 386, however, some string
instructions are, even under ideal circumstances, the best choice only by a
whisker, if at all.  For example, since Day One, clearing a buffer has been
done with REP STOS.  That's certainly faster than the looping MOV/ADD
approach shown in Listing Three (page 73), but on the 386 and 486 it's no
faster than the unrolled loop MOV/ADD approach of Listing Four (page 73), as
shown in Figure 4.  (Actually, in my tests REP STOS was a fraction of a cycle
slower on the 386, and fractionally faster on the 486.)  REP STOS is much
easier to code and more compact, so it's still the approach of choice for
buffer clearing--but it's not necessarily fastest on a 486 or fast-memory
386.  This again demonstrates just how unreliable the old optimization rules
are on the newer processors.

The point is not that you shouldn't use string instructions on the 386.  REP
MOVs is the best way to move data, and the other string instructions are
compact and usually faster, especially on uncached systems.  However, on the
386 it's no longer worth going to the trouble of juggling registers and
reorganizing data structures to use string instructions.  Furthermore, when
you truly need maximum performance on the 386, check out nonstring
instructions in unrolled loops.  It goes against every lesson learned in a
decade of 8088 programming, but avoiding string instructions sometimes pays
on the 386.

The Siren Song of Memory Accesses

Finally, here's a rule that's constant from the 8088 to the 486: Use the
registers.  Avoid memory.

Don't be fooled by the much faster memory access times of the 286 and 386.
The effective address calculation time of the 8088 is mostly gone, so MOV
AX,[BX] takes only five cycles on the 286, and ADD [SI],DX takes only seven
on the 386.  That's so much faster than the 17 and 29 cycles, respectively,
that they take on the 8088 that you might start thinking that memory is
pretty much interchangeable with registers.

Think again.  MOV AX,BX is still more than twice as fast as MOV AX,[BX] on
the 286, and ADD SI,DX is more than three times as fast as ADD [SI],DX on the
386.  Memory operands can also reduce performance by slowing instruction
fetching.  Memory is fast on the 286 and 386.  Registers are faster.  Use
them as heavily as possible.

Don't Branch

Here's another rule that stays the same across the 80x86 family: Don't
branch.  Branching suffers on the 8088 from lengthy cycle counts and emptying
the prefetch queue.  Emptying the prefetch queue is a lesser but nonetheless
real problem in the post-8088 world, and the cycle counts of branches are
still killers.  As Figure 4 indicates, it pays to eliminate branches by
unrolling loops or using repeated string instructions.

Modern-Day Instruction Fetching

Instruction fetching is the bugbear of 8088 performance; the 8088 simply
can't fetch instruction bytes as quickly as it can execute them, thanks to
its undersized bus.  Minimizing all memory accesses, including instruction
fetches, is paramount on the 8088.

Instruction fetching is less of a problem nowadays.  Figure 5 shows the
maximum rates at which various processors can fetch instruction bytes;
clearly, matters have improved considerably since the 8088, although
instructions also execute in fewer cycles on the newer processors.  Fetching
problems can occur on any 80x86 processor, even the 486, but the only
processors other than the 8088 that face major instruction fetching problems
are the one-wait-state 286 and the 386SX, although uncached 386s may also
outrun memory.  However, the problems here are different from and less
serious than with the 8088.

Consider: An 8088 executes a register ADD in three cycles, but requires eight
cycles to fetch that instruction, a fetch/execute ratio of 2.67.  A
one-wait-state 286 requires three cycles to fetch a register ADD and executes
it in two cycles, a ratio of 1.5.  A 386SX can fetch a register ADD in two
cycles, matching the execution time nicely, and a cached 386 can fetch two
register ADDs in the two cycles it takes to execute just one.  For
register-only code--the sort of code critical loops should contain--the 386
generally runs flat out, and the 286 and 386SX usually (not always, but
usually) outrun memory by only a little at worst.  Greater fetching problems
can arise when working with large instructions or instruction sequences that
access memory nonstop, but those are uncommon in critical code.  This is a
welcome change from the 8088, where small, register-only instructions tend to
suffer most from inadequate instruction fetching.

Also, uncached 386 systems often use memory architectures that provide
zero-wait-state performance when memory is accessed sequentially.  In
register-only code, instruction fetches are the only memory accesses, so
fetching proceeds at full speed when the registers are used heavily.

So, is instruction fetching a problem in the post-8088 world?  Should
instructions be kept short?

Yes.  Smaller instructions can help considerably on the one-wait-state 286
and on the 386SX.  Not as much as on the 8088, but it's still worth the
trouble.  Even a cached 386 can suffer from fetching problems, although
that's fairly uncommon.  For example, when several MOV WORD PTR [MemVar],0
instructions are executed in a row, as might happen when initializing memory
variables, performance tends to fall far below rated speed, as shown in
Figure 6.  The particular problem with MOV WORD PTR [MemVar],0 is that it
executes in just two (386) or three (286) cycles, yet has both an addressing
displacement field and a constant field.  This eats up memory bandwidth by
requiring more instruction fetching.  It also accesses memory, eating up
still more bandwidth.  We'll see this again, and worse, when we discuss
protected mode.

Generally, though, post-8088 processors with fast memory systems and
full-width buses run most instructions at pretty near their official cycle
times; for these systems, optimization consists mostly of counting cycles.
Slower memory or constricted buses (as in the 386SX) require that memory
accesses (both instruction fetches and operand accesses) be minimized as
well.  Fortunately, the same sort of code--register only--meets both
requirements.

Use the registers.  Avoid constants.  Avoid displacements.  Don't branch.
That's the big picture.  Don't sweat the details.

Alignment: The Easy Optimization

The 286, 386SX, and 386 take twice as long to access memory words at odd
addresses as at even addresses.  The 386 takes twice as long to access memory
dwords at addresses that aren't multiples of four as those that are.  You
should use ALIGN 2 to word align all word-sized data, and ALIGN 4 to dword
align all data that's accessed as a dword operand, as in:

ALIGN  4 MemVar  dd  ? : MOV EAX,[MemVar]

Alignment also applies to code; you may want to word or dword align the
starts of procedures, labels that can only be reached by branching, and the
tops of loops.  (Code alignment matters only at branch targets, because only
the first instruction fetch after a branch can suffer from nonalignment.)
Dword alignment of code is optimal, and will help on the 386 even in real
mode, but word alignment will produce nearly as much improvement as dword
alignment without wasting nearly as many bytes.

Alignment improves performance on many 80x86 systems without hindering it on
any.  Recommended.

Protected Mode

There are two sorts of protected mode, 16-bit and 32-bit.  The primary
optimization characteristic of 16-bit protected mode (OS/2 1.X, Rational DOS
Extender) is that it takes an ungodly long time to load a segment register
(for example, MOV ES,AX takes 17 cycles on a 286) so load segment registers
as infrequently as possible in 16-bit protected mode.

Optimizing for 32-bit protected mode (OS/2 2.0, SCO Unix, Phar Lap DOS
Extender) is another matter entirely.  Typically, no segment loads are needed
because of the flat address space.  However, 32-bit protected mode code can
be bulky, and that can slow instruction fetching.  Constants and addressing
displacements can be as large as 4 bytes each, and an extra byte, the SIB
byte, is required whenever two 32-bit registers are used to address an
operand or scaled addressing is used.  So, for example, MOV DWORD PTR
[MemVar],0 is a 10-byte instruction in 32-bit protected mode.  The
instruction is supposed to execute in two cycles, but even a 386 needs four
to six cycles to fetch it, plus another two cycles to access memory; a few
such instructions in a row can empty the prefetch queue and slow performance
considerably.  The slowdown occurs more quickly and is more acute on a 386SX,
which needs 14 cycles to perform the memory accesses for this nominally
2-cycle instruction.

Code can get even larger when 32-bit instructions are executed in 16-bit
segments, adding prefix bytes.  (Avoid prefix bytes if you can; they increase
instruction size and can cost cycles.)  Figure 7 shows actual versus nominal
cycle times of multiple MOV DWORD PTR [EBX*4+MemVar],0 instructions running
in a 16-bit segment.  Although cache type (write-back, write-through) and
main-memory write time also affect the performance of stores to memory, there
is clearly a significant penalty for using several large (in this case,
13-byte) instructions in a row.

Fortunately, this is a worst case, easily avoided by keeping constants and
displacements out of critical loops.  For example, you should replace:

ADDLOOP: MOV  DWORD PTR BaseTable[EDX+EBX],0 ADD  EBX,4 DEC  ECX JNZ  ADDLOOP

with:

LEA  EBX,BaseTable[EDX+EBX] SUB  EAX,EAX ADDLOOP: MOV  [EBX],EAX ADD  EBX,4
DEC  ECX JNZ  ADDLOOP

Better yet, use REP STOSD or unroll the loop!

Happily, register-only instructions are no larger in 32-bit protected mode
than otherwise and run at or near their rated speed in 32-bit protected mode
on all processors.  All in all, in protected mode it's more important than
ever to avoid large constants and displacements and to use the registers as
much as possible.

Conclusion

Optimization across the 80x86 family isn't as precise as 8088 optimization,
and it's a lot less fun, with fewer nifty tricks and less spectacular
speed-ups.  Still, familiarity with the basix 80x86 optimization rules can
give you a decided advantage over programmers still laboring under the
delusion that the 286, 386, and 486 are merely faster 8088s.

References

Abrash, Michael.  Zen of Assembly Language.  Glenview, Ill.: Scott, Foresman,
1990.

Barrenechea, Mark.  "Peak Performance: On to the 486."  Programmer's Journal,
(November-December 1990).

Paterson, Tim.  "Assembly Language Tricks of the Trade."  Dr. Dobb's Journal
(March 1990).

Turbo Assembler Quick Reference Guide.  Borland International, 1990.

i486 Microprocessor Programmer's Reference Manual.  Intel Corporation, 1989.

80386 Programmer's Reference Manual.  Intel Corporation, 1986.

Microsystems Components Handbook: Microprocessors Volume I.  Intel
Corporation, 1985.