OK, here is some information I put together over the last couple of days. I would like some kind of responce, positive or negative over what I've written. No point in uploading it to x2ftp or cdrom if you all think it's crap. ------------- begin ftmap.txt Fast texture mapping -------------------- by Mats Byggmastar a.k.a. MRI / Doomsday mri@penti.sit.fi 19 Jun. 1996 Espoo, Finland Read this today, it might be obsolete tomorrow. About this document ------------------- This document tries to describe how to make fast texture mapping. It is not a complete tutorial as it only describes the time critical parts which happens to be the most interesting parts. The information here is not aimed to be a beginners guide to texture mapping. The information is aimed at people who maybe already have a working texture mapper but are looking for more speed. The information is based on my own work and findings but also on information found on the net and on ideas given to me by other coders. Especially the individuals that hang out on IRC channel #coders. I once spoke to a 'cool' dude from .au that said: "Man, did you think we come here to talk about coding?". Well, what can I say... Go screw yourself mister! Many of the coders there are willing to share ideas and answer decent questions. Just look for .fi in their email address... only joking, huhu. I am not claiming that the method described here is THE fastest method of doing texture mapping. But it is pretty fast. The resulting image will look a bit more ragged than ordinary u,v interpolation in the inner loop but one can't have both quality and speed in this case, I think. Ah, yes. This document describes linear texture mapping, not perspective correct mapping. So if you are looking for some information on how to do scanline subdivision, look somewhere else. You won't find it here. "Linear mapping should be enough for anyone..." Disclaimer ---------- If water didn't exist, no one could learn how to swim, then everybody would drown. I take no responsibility for all the people drowning in Finland this summer. Some parts of the technical discussion at the end of the document might not be 100% accurate. But from a programmers point of view the guidelines given should apply anyway. Assume the following -------------------- - We are only doing triangles. - You agree that a fractional part of 8 bit is enough for interpolating u and v on the scanlines of the triangle. - Bitmaps always has a width of 256 pixels and a maximum height of 256 pixels. - The CPU is a Pentium and we are in 32 bit protected mode and flat memory model (TASM+WATCOM+PMODE/W). - You have made your own Phong shader or texture mapper that is not quite as fast as your friends texture mapper, that he brags about being 1.5 tick per pixel, doing 70k polys per sec. (What a stupid way of expressing the speed of a texture mapper really. Who want's to see a vector sequence at 1 frame per second anyway, hehe.) The slow method --------------- The slow method of doing texture mapping is to interpolate u and v (and triangle x) on both the left and right side of the triangle and then calculate the delta u and v for each scanline. Then interpolate the u and v when drawing the scanline. To make this even slower you could first interpolate the u and v (and triangle x) for both sides of the triangle and store the values in a edge buffer. Then pick the values from the edge buffer and draw each scanline. The fast method --------------- Don't use a edge buffer as described in the slow method above. Calculate the edge deltas when you need them and interpolate along the edges at the same time you draw each scanline. One important thing you should realize is that when texturemapping a triangle, you are interpolating two variables, u and v, whose deltas are constant over the whole triangle. I repeat, the deltas are constant for the whole triangle. Make sure you understand this because this is the key to fast texture mapping (or any other type of linear shading for that matter). Because the deltas (delta u and delta v) are constant, we only need to calculate them once for the whole triangle. Also when interpolating u and v along the edges of the triangle you only need to interpolate u and v on one side of the triangle. (Triangle x must be interpolated on both sides.) The deltas are constant so the offsets for each pixel in each scanline into the bitmap will also be constant. I.e. we can precalculate a whole run and use that in the inner loop. The inner loops for this type of texture mapper can look very different. The most radical must be to unroll it all the way and to plug in the offsets right into the mov instructions, i.e. selfmodifying code. But do be careful with selfmodifying code. It can slow down your code instead of speeding it up. Each time you modify the code, it will be reloaded into the code cache which seems to take a long time. Also unrolling code is a very bad idea on the Pentium chip. Here is some crappy code that shows the principle of this type of inner loop: jmp ecx ; Jump to the right place in the 'loop' mov al, [esi+12345] mov [edi+319], al mov al, [esi+12345] ; Get pixel mov [edi+318], al ; Plot pixel ...... mov al, [esi+12345] ; '12345' is the selfmodifying part mov [edi+2], al ; that will be modified once per triangle mov al, [esi+12345] mov [edi+1], al mov al, [esi+12345] mov [edi+0], al Note that we are doing it backwards, from right to left. This makes it easier to setup esi and edi. You might want to know that each instruction in this loop is 7 byte so you will be doing a X*14 when calculating the jump offset. X*14 is X*16-X*2. You should offcource not plug in the offsets for the whole loop if you only have a small triangle. The length of the longest scanline is a byproduct from the constant delta calculations. So what about the 1.5 tick per pixel loop? Well the following peace of code is usually what people think of. I'm not really sure that this is actually 1.5 tick per pixel as the 'mov [edi+?],ax' has a prefix byte. This code will need some work to make the instructions pair on the Pentium (Read about the code cache at the end). jmp ecx ...... mov al, [esi+12345] mov ah, [esi+12345] mov [edi+4], ax mov al, [esi+12345] mov ah, [esi+12345] mov [edi+2], ax mov al, [esi+12345] mov ah, [esi+12345] mov [edi], ax Now to a cooler method that is not selfmodifying and don't need to be unrolled. The idea is very similar to the above but in this loop we have the offsets stored in a lookup table instead. A loop that plots one pixel per turn could then look like this: destptr = screen + edge_x + y * screenwidth; srcptr = bitmap + edge_u + edge_v * 256; for( scanlinewidth ) { *(destptr++) = srcptr[ *(lookup++) ]; } ; esi = ptr to starting position in bitmap ; ebx = 0 ; edx = ptr to lookup table ; edi = ptr to starting point in screen buffer ; ebp = width of scanline @@inner: mov al, [esi+ebx] ; Get pixel mov ebx, [edx] ; Get offset for next pixel mov [edi], al ; Plot pixel add edx, 4 inc edi dec ebp jnz @@inner The above loop should be something like 4 ticks per pixel on a Pentium if we don't have any cache misses and such. Not very impressive. This loop can on the other hand be changed to plot 4 pixels at a time: @@inner: mov al, [esi+ebx] ; Get pixel #1 mov ebx, [edx] mov ah, [esi+ecx] ; Get pixel #2 mov ecx, [edx+4] shl eax, 16 ; Move pixels 1 and 2 to the high word add edi, 4 mov al, [esi+ebx] ; Get pixel #3 mov ebx, [edx+8] mov ah, [esi+ecx] ; Get pixel #4 mov ecx, [edx+12] rol eax, 16 ; Swap the high and low words add edx, 16 mov [edi], eax ; Plot all 4 pixels dec ebp jnz @@inner Now this one is 9 ticks per 4 pixel with the pixels written as a dword. Very good if we align the write on dword. Use the other loop for very short lines or to get this one aligned on dword and use this for the rest of the scanline. Calculate the lookup table with the following loop (this loop can also be used to calculate the offsets in the selfmodifying example): u = delta_u; v = delta_v; for( tablewidth ) { *(lookup++) = u + v * 256; u = u + delta_u; v = v + delta_v; } ; ebx = esi = delta u (26:8 bit fixed point) ; ecx = edi = delta v (26:8 bit fixed point) ; edx = ptr to lookup table ; ebp = length of table (the width of the longest scanline) @@mklookup: mov eax, ecx add ecx, edi ; Interpolate v mov al, bh add ebx, esi ; Interpolate u mov [edx], eax ; lookup[edx] = u+256*v add edx, 4 dec ebp jnz @@mklookup Note that the first offset stored in the lookup table is not 0. Equations for the constant deltas --------------------------------- Sort the vertices in the triangle so that the topmost vertex is known as x1,y1 and the bottom vertex is known as x3,y3. Like the drawing below. x1,y1 p1 / / / / / / / / / / / x2,y2 / / p2 /_____________/ \ width / \ / \ / \ / \/ x3,y3 p3 xn,yn - x,y screen coordinates at vertex n (integers) pn - Value of variable at vertex n to calculate the constant delta for. Note that this variable is assumed to have a 6 bit fractional part (26:6 bit fixed point). width - Width of the longest scanline in the triangle Sorting 3 vertices is no more that 3 compares. Another thing: Don't load all x,y,u and v values of the vertices into registers. Use pointers to the vertex structures instead. Something like this: ; EDX -> vertex 1 ; ESI -> vertex 2 ; EDI -> vertex 3 mov EAX, [EDX+vertex_y] cmp EAX, [ESI+vertex_y] jle short @@sorta xchg EDX, ESI @@sorta: mov EAX, [EDX+vertex_y] cmp EAX, [EDI+vertex_y] jle short @@sortb xchg EDX, EDI @@sortb: mov EAX, [ESI+vertex_y] cmp EAX, [EDI+vertex_y] jle short @@sortc xchg ESI, EDI @@sortc: ; EDX -> topmost vertex ; ESI -> middle vertex ; EDI -> bottom vertex The following two equations needs only be calculated once for all the constant deltas in the triangle. Skip the triangle if y3 == y1, i.e. if the triangle has zero height. The width can be either positive or negative depending on which side the x2,y2 vertex is. This will be useful information when sorting out which is left and which is the right side of the triangle. (y2-y1) << 16 temp = -------------- y3-y1 width = temp * (x3-x1) + ((x1-x2) << 16) The equation below is used to calculate the delta for a variable that should be interpolated over the triangle, e.g. texture u. Beware of the denominator in this equation! Make sure it won't cause divide overflow in case the width is less than one pixel. (Remember that width is a 16:16 bit fixed point number.) Note that shift by 10 in the equation. This is because p1,p2,p3 has a 6 bit fractional part. The resulting delta p is a 16:16 bit number. ( temp * (p3-p1) + ((p1-p2) << 16) ) << 10 delta p = -------------------------------------------- width So for a texture mapper where we have 2 variables (u,v) to interpolate over the triangle, we have a total of 3 divs and 3 muls to calculate the needed deltas. Here is another equation that can be used to calculate the deltas. It was posted to the newsgroup comp.graphic.algorithm the other day by Mark Pursey. There is a cleaner way, which doen't rely on finding the widest line: A-B-C: a triangle with screen x and y components, as well as t, a value which could represent lightning, texture cordinates etc. The following equation gives you the increment for t per horizontal pixel: (At-Ct)*(By-Cy) - (Bt-Ct)*(Ay-Cy) dt/dx = --------------------------------- (Ax-Cx)*(By-Cy) - (Bx-Cx)*(Ay-Cy) The denominator is reusable for both u and v. This makes a total of 6 muls and 2 divs. However, I do not know if this works as I don't know the theory behind it (it looks like some vector algebra) and haven't tested it myself. Feel free to try it out. Special case code ----------------- It often pays off to make special case code that takes care of the delta calculations when the triangle is 1, 2 or 4 pixels high. Then you can skip the divs and use shifts instead. This is when you are calculating the deltas to interpolate along the edges of the triangle. I once made a histogram of the length of each scanline in the very popular chrmface.3ds object. This object has about 850 triangles and was scaled up so it just touched the top and the bottom of a 320x200 pixel screen. The histogram showed that most scanlines was only 1 or 2 pixels wide. This proves that the outer loop is just as important as the inner loop and also that it might be a good idea to have special case code for those 1 or 2 pixel lines. width number of scanlines 1 ********************* 2 ****************** 3 ********** 4 ****** 5 *** 6 ** 7 ** Clipping -------- Clipping is most of the time a real pain in the ass implementing. It will always mess up a nice looking routine with extra junk. One possibility is to have two separate functions, one with clipping and one with no clipping. Then test the triangle if it needs clipping before calling any of the functions. The actual clipping code is not that difficult to implement really. Say if you need to clip a texture mapped scanline, you first have to get the number of pixels you need to skip at the end of the scanline and the number of pixels in the beginning of the scanline. Then subtract the number of pixels skipped from the original scanline width. If you skipped some pixels at the start of the scanline, the new starting u and v must be calculated. This is done by multiplying the pixels skipped by delta u and delta v respectively. And adding the original starting u and v offcource. The following code is what I'm using to sort out the stuff: movsx EBP, word ptr [left_x+2] ; Get the integer part from the movsx ECX, word ptr [right_x+2] ; 16:16 bit numbers. mov EDX, EBP sub EDX, ECX ; EDX = width of scanline ; ECX = x1 ; EBP = x2 mov EBX, EDX sub EBP, [_RightClip] jle short @@rightok sub EDX, EBP ; skip pixels at end @@rightok: xor EAX, EAX cmp ECX, [_LeftClip] jge short @@leftok mov EAX, [_LeftClip] sub EAX, ECX mov ECX, [_LeftClip] @@leftok: sub EDX, EAX ; skip pixels at start jle @@notvisible ; EAX = pixels skipped at start ; ECX = clipped x1 ; EDX = clipped width of scanline So now you just have to multiply EAX by delta u and add the original u to get the clipped u. The same apply for v. The data cache -------------- Although it is fun optimizing inner loops there are other important factors that one should look at. With today's processors the cache aspects are very important. Maybe more important than the speed of the inner loop. Don't know how long this is true though as the caches are getting smarter and bigger. How about integrating a 1MB cache on the same chip as the CPU, hehe? I am no expert on how the cache works but I think that this is the general idea: When the CPU has decoded an instruction that needs to get a variable from memory, the CPU first checks the cache to see if the variable is already in the cache. If it is there the CPU reads the variable from the cache. This is called a cache hit. If the variable is not in the cache the CPU first has to wait for the data to be read from RAM (or the secondary cache) into the cache and first after that get the variable from the cache. The cache always loads a full cacheline at a time so this will take a few clock ticks. A cacheline is 16 byte on a 486 and 32 byte on Pentium. The advantage of this is when reading byte after byte from the memory, the data will most of the time already be loaded into the cache because we have accessed the same cacheline just before. So in the case of a texture mapper where we might be reading pixels in a vertical line in the bitmap, the inner loop will be accessing pixels that are >256 bytes apart. The CPU will then be busy filling cachelines for each pixel. A 64k bitmap won't fit inside a 8k cache, you know. So what can we do? Well, we can buy a bigger cache or we might consider storing our bitmaps some other way. I got an interesting tip from Otto Chrons the other day about this. He said that one should store the bitmap as a set of tiles, say 8 x 8 pixels instead of the usual 256 x 256 pixel. This would mean that a small part of the bitmap (8 x 4 pixel) would fit in the same 32 byte cacheline. This way new cachelines don't need to be loaded that often when reading pixels in a vertical line in the bitmap. The inner loop of this mapper will be pretty slow (or maybe not, actually) but the gained speed from the better cache usage could possibly make it faster than any other type of texture mapper. The code cache -------------- The cool thing about Pentiums is that it can execute two instructions in parallel. This is called instruction pairing. But there is a lot of rules that must be fulfilled for the pairing to take place. One rule is that both instructions must already be in the code cache. This means that the first time trough a innerloop, no instructions will pair. There is one exception to this rule. If the first instruction is a 1 byte instruction, e.g. inc eax, then they will pair the first time. So to have a completely unrolled loop that is selfmodifying is a pretty bad idea on the Pentium. On the other hand, we are not modifying the loop for each scanline so chances are that parts of it will be in the code cache from drawing the previous scanline. Some pairing rules ------------------ First of all, both instructions must be simple instructions, i.e. mov, inc, dec, add, sub, and, or, xor, cmp, test, jcc, ...(?) Instructions like neg, xchg, idiv, imul will never pair. Another rule is that the second instruction can't depend on the first. Here are some examples: add ecx, eax ; store result in ecx add edx, ecx ; get result from ecx. No pairing! mov ecx, eax mov edx, ecx ; No pairing! mov al, bh ; al and ah is in the in the same register mov ah, ch ; No pairing! mov ecx, eax ; mov edx, eax ; Pairs ok. mov ecx, eax add eax, edx ; Pairs ok. There are two exception to this rule. Namely the flag register and the stack pointer. Intel has been kind enough to optimize these for us. dec ecx ; modifies the flag register jnz @@inner ; Pairs ok. push eax ; both instructions are accessing esp push ebx ; Pairs ok. So for example the loop we used to calculate the lookup table with, all instructions are simple and not dependent on the previous one. The 8 instructions should execute in 4 clock ticks. @@mklookup: mov eax, ecx add ecx, edi ; Pairs ok. mov al, bh add ebx, esi ; Pairs ok. mov [edx], eax add edx, 4 ; Pairs ok. dec ebp jnz @@mklookup ; Pairs ok. Another thing that one should watch out for is the effective address calculations. I.e. what the CPU must do if you are using indexes to access data in memory. You must not write to the index register just before you access the data. Actually there must be a few instructions in between the write and the access. add esi, ebx ; Move the array pointer. inc edi mov eax, [esi+8] ; Ouch! You just modified esi. add esi, ebx ; Move the array pointer. add ebx, ecx ; Do something useful here inc edi ; while waiting for the add ebp, 4 ; address to be resolved. mov eax, [esi+8] ; Now it's OK to access the data. Pairs ok. If you don't have any useful instructions to fill out the gap with you could try to swap the two instructions so that you access the data first and then modify the index register. mov eax, [esi+8] add esi, ebx ; Pairs ok. There is a lot more rules one must follow so I suggest you buy a good book on the subject, if you are really interested in this. Branch prediction ----------------- Just a few closing words about the branch prediction logic on the Pentium. The Pentium has some sort of lookup table in which it stores the last 100 (?) jumps it has taken. In other words it remembers the last conditional jumps if it took the jumps or not. So when getting closer to a jcc (conditional jump) instruction it will in an early stage of the instruction pipeline check the lookup table to see if it jumped the last time or not. If it took the jump the last time it will assume that it should take the jump this time also. The CPU will then start filling the instruction pipeline with instructions from where the jump instruction is pointing to. Then when the actual jump instruction arrives in the execution stage, the pipeline is already filled with the right instructions, so there will be no delay. On the other hand, if the condition had changed this time and we did not take the jump, the pipeline would be filled with the completely wrong instructions. Ouch! Now the pipeline must be filled with the right instructions which in this case are those following right after the jcc instruction. This will take a few clock ticks. So you should avoid inner loops that jumps differently at close intervals. Like the following: jmp @@inner @@extra: .... ; Do something extra when we get carry overflow dec ebp jz @@done @@inner: .... ; Do something useful here add eax, ebx jc @@extra ; Jump on carry overflow dec ebp jnz @@inner @@done: In this loop it is the 'jc @@extra' instruction that will mess up the branch prediction. Sometimes the jump will be taken and sometimes not. The typical way of doing masking with compares and jumps has this problem also. Credits and greetings --------------------- Nix / Logic Design Otto Chrons a.k.a. OCoke jmagic (Can I send greetings to jmagic? Well I'm doin it :) MacFeenix BX thefear FatPhil LoZEr Addict, Damac, Dice, Doom, Swallow / Doomsday -- --=(__Mats_Byggmastar____a.k.a._MRI__3D-coder__)=-- --=(__email_mri@penti.sit.fi_(mri@surfnet.fi)__)=-- --=(__www______________http://www.sit.fi/~mri__)=--