From - Fri Jan 31 09:12:01 1997 Path: nntp.farm.idt.net!news.idt.net!cam-news-hub1.bbnplanet.com!news.bbnplanet.com!cpk-news-hub1.bbnplanet.com!news.maxwell.syr.edu!news-feed.inet.tele.dk!news1.tele.dk!cph-1.news.DK.net!dkuug!dknet!usenet From: Agner@login.dknet.dk (Agner Fog) Newsgroups: sci.math,sci.math.num-analysis,sci.stat.math,comp.programming,comp.lang.c++,comp.lang.asm.x86 Subject: Excellent pseudo random number generator Date: Fri, 31 Jan 1997 15:20:02 +0100 Organization: Customer at DKnet Lines: 625 Distribution: inet Message-ID: Reply-To: agner@login.dknet.dk NNTP-Posting-Host: login.dknet.dk Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Xref: nntp.farm.idt.net sci.math:11896 sci.math.num-analysis:2025 sci.stat.math:1533 comp.programming:6008 comp.lang.c++:25917 comp.lang.asm.x86:12838 [copyright (c) 1997 by Agner Fog, noncommercial use allowed] I couldn't find a random number generator that was good enough for my very high requirements, so I made one myself. The result exceeded all my expectations, and therefore I am going to share my findings with others and hear your opinions. The contents of this long posting are: 1. Introduction 2. Algorithm 3. C++ source code 4. Optimized assembly language source code 5. Theoretical analysis and discussion 6. Request for feedback 7. Conclusion 1. Introduction --------------- I need a good pseudo random number generator for some very heavy Monte Carlo simulations. My requirements are: - Floating point output with a flat distribution in the interval between 0 and 1. - Very high resolution - Extremely high cycle length - No correlation between consequtive numbers - The sequence that follows after an extreme number must be perfectly random - Fast execution Unsatisfied with the codes I could find elsewhere I developed the code that follows, which more than fulfills my requirements. 2. Algorithm ------------ My method is inspired by the widely used additive generator (ADDGEN) which has the following simple algorithm: X[n] = X[n-j] + X[n-k] The X-values are binary numbers of b bits, where the overflow from the addition is ignored. So to be exact, we have: X[n] = (X[n-j] + X[n-k]) mod 2^b This algorithm is easily implemented with a circular buffer of k b-bit numbers. The cycle length of the ADDGEN generator is far smaller than the size of the circular buffer permits because there is no flow of information from the high bits to the low bits, as explained in section 5. I have therefore improved this algorithm by shuffling the bits. The improved algorithm is: Hi(X[n]) = (Lo(X[n-j]) + Lo(X[n-k])rotr r) mod 2^(b-1) Lo(X[n]) = (Hi(X[n-j]) + Hi(X[n-k])) mod 2^(b-1) Where Hi(X) means the upper b/2 bits of X, and Lo(X) means the lower half of X. Y rotr r means the bits of Y rotated r places to the right: 00001111(bin) rotr 3 = 11100001(bin) The choise of b is determined by the desired resolution and the microprocessor word size. r must be relatively prime to b, j must be relatively prime to k, and k-j must be odd. The best choise is when k, j/2, and k-j are all primes, and j is between 0.5*k and 0.7*k. This algorithm is called SHUFFLADD in the following. The source code below uses b=64, r=7, j=10, k=17. 3. C++ source code ------------------ #include // define parameters #define KK 17 #define JJ 10 #define RR 7 typedef class { public: void RandomInit (double seed); long double Random (); protected: union { long double randp1; long randbits[3];}; int p1, p2; long randbuffer[KK+1][2]; } TRandomGenerator; long double TRandomGenerator::Random() { // returns a random number between 0 and 1. long a, b; // generate next number b = _lrotr(randbuffer[p1][0], RR) + randbuffer[p2][0]; a = randbuffer[p1][1] + randbuffer[p2][1]; randbuffer[p1][0] = a; randbuffer[p1][1] = b; // rotate list pointers if (--p1 < 0) p1 = KK - 1; if (--p2 < 0) p2 = KK - 1; // convert to float randbits[0] = a; #if sizeof(long double) == 10 randbits[1] = b | 0x80000000; #elif sizeof(long double) == 8 randbits[1] = b & 0x000FFFFF | 0x3FF00000; #else #error incompatible floating point format #endif // normalize floating point number return randp1 - 1.;} void TRandomGenerator::RandomInit (double seed) { // this function initializes the random number generator. // must be called before the first call to Random! int i, j; long double s; // get seed into an acceptable range if (seed < 0) seed = 0.147328926107 - seed; while (seed >= 1) seed *= 0.0817835148503; s = seed; // make random numbers and put them into the buffer for (i=0; i> r) | (x << (sizeof(x)-r));} 4. Optimized assembly language source code ------------------------------------------ ; The random function is optimized for the Pentium microprocessor. ; The RandomInit function does not need optimizing because it is ; only called once. ; 32-bit mode: ; ------------ .386 ; define parameters KK equ 17 ; size of circular buffer JJ equ 10 ; second pointer RR equ 7 ; rotate count RAND_DATA SEGMENT PARA USE32 PUBLIC 'DATA' align 16 randp1 DT 1.5 ; used for conversion to float DW 0 ; alignment p1 DD 0 ; pointer in circular buffer p2 DD 0 ; pointer in circular buffer rbufend DD (KK-1)*8 ; pointer to end of buffer randbuf DD (2*KK) dup(?) ; circular buffer DD ? ; receives junk one DD 1.0 ; constant RandAdd DQ 0.147328926107; arbitrary constants used by RandomInit RandFactor DQ 46.1063140045 RandDiv DQ 0.0817835148503 RAND_DATA ENDS DGROUP GROUP RAND_DATA _TEXT SEGMENT DWORD USE32 PUBLIC 'CODE' ASSUME CS:_TEXT,DS:DGROUP PUBLIC Random1,_Random1,Random2,_Random2,RandomInit,_RandomInit Random1 PROC NEAR _Random1 LABEL NEAR FLD [randp1] ; load random number FSUB [one] ; normalize Random2 LABEL NEAR ; second entry for integer output _Random2 LABEL NEAR MOV ECX, [p1] ; ring buffer pointers MOV EDX, [p2] PUSH EBX NOP ; avoid imperfect pairing due to AGI MOV EBX, [randbuf][ECX] ; shuffle-and-add algorithm MOV EAX, [randbuf][ECX+4] ROR EBX, RR ; unpaired ADD EBX, [randbuf][EDX] ADD EAX, [randbuf][EDX+4] MOV [randbuf][ECX], EAX MOV [randbuf][ECX+4], EBX OR EBX, 80000000H ; bit 63 must be set MOV DWORD PTR [randp1],EAX ; low 32 bits of next random number SUB ECX, 8 ; decrement p1 JNC SHORT R1 MOV ECX, [rbufend] R1: MOV DWORD PTR [randp1+4],EBX ; bits 32-62 of next random number POP EBX SUB EDX, 8 ; decrement p2 JNC SHORT R2 MOV EDX, [rbufend] R2: MOV [p1], ECX MOV [p2], EDX RET random1 ENDP RandomInit PROC NEAR _RandomInit LABEL NEAR FLD1 FLD QWORD PTR [ESP+4] ; seed FTST ; check if seed is > 0 FSTSW AX SAHF JNC SHORT R20 FSUBR [RandAdd] JMP SHORT R20 R10: FMUL [RandDiv] ; make seed < 1 R20: FCOM ST(1) FSTSW AX SAHF JAE R10 MOV ECX, OFFSET DS:[randbuf] ; set up for loop MOV EDX, KK*2 R30: FSQRT ; make random number FADD [RandAdd] FMUL [RandFactor] FPREM ; get decimals FST QWORD PTR [ECX] ; only lower 32 bits used, rest overwritten next time ADD ECX, 4 DEC EDX JNZ R30 FSTP ST(0) ; discard seed FSTP [randp1] ; initialize bit 64-79 MOV [p1], (KK-JJ-1)*8 ; initialize buffer pointers MOV [p2], (KK-1)*8 CALL Random2 ; prepare first random number RET 8 ; RET 0 if _cdecl calling RandomInit ENDP _TEXT ENDS END ; To optimize for PentiumPro processor: Remove the NOP and replace ; the conditional jumps to R1 and R2 with conditional moves. ; 16-bit mode (this is slower): ; ----------------------------- .386 ; define parameters KK equ 17 ; size of circular buffer JJ equ 10 ; second pointer RR equ 7 ; rotate count RAND_DATA SEGMENT PARA USE16 PUBLIC 'DATA' align 16 randp1 DT 1.5 ; used for conversion to float p1 DW 0 ; pointer in circular buffer p2 DW 0 ; pointer in circular buffer rbufend DW (KK-1)*8 ; pointer to end of buffer align 8 randbuf DD (2*KK) dup(?) ; circular buffer DD ? ; receives junk one DD 1.0 ; constant RandAdd DQ 0.147328926107; arbitrary constants used by RandomInit RandFactor DQ 46.1063140045 RandDiv DQ 0.0817835148503 RAND_DATA ENDS DGROUP GROUP RAND_DATA _TEXT SEGMENT BYTE USE16 PUBLIC 'CODE' ASSUME CS:_TEXT,DS:DGROUP PUBLIC Random1,_Random1,Random2,_Random2,RandomInit,_RandomInit Random1 PROC NEAR ; FAR if model is medium or large _Random1 LABEL NEAR FLD [randp1] ; load random number FSUB [one] ; normalize Random2 LABEL NEAR ; second entry for integer output _Random2 LABEL NEAR MOV BX, [p1] ; ring buffer pointers PUSH SI MOV SI, [p2] NOP ; avoid imperfect pairing due to AGI MOV EDX, [randbuf][BX] ; shuffle-and-add algorithm MOV EAX, [randbuf][BX+4] ROR EDX, RR ADD EDX, [randbuf][SI] ADD EAX, [randbuf][SI+4] MOV [randbuf][BX], EAX MOV [randbuf][BX+4], EDX OR EDX, 80000000H ; bit 63 must be set MOV DWORD PTR [randp1],EAX ; low 32 bits of next random number SUB SI, 8 ; decrement p2 JNC SHORT R2 MOV SI, [rbufend] R2: MOV DWORD PTR [randp1+4],EDX ; bits 32-62 of next random number SUB BX, 8 ; decrement p1 JNC SHORT R1 MOV BX, [rbufend] R1: MOV [p1], BX MOV [p2], SI POP SI RET random1 ENDP RandomInit PROC NEAR ; FAR if model is medium or large _RandomInit LABEL NEAR MOV BX, SP FLD1 FLD QWORD PTR SS:[BX-2*(TYPE RandomInit)] ; seed FTST ; check if seed is > 0 FSTSW AX SAHF JNC SHORT R20 FSUBR [RandAdd] JMP SHORT R20 R10: FMUL [RandDiv] ; make seed < 1 R20: FCOM ST(1) FSTSW AX SAHF JAE R10 MOV BX, OFFSET DS:[randbuf] ; set up for loop MOV DX, KK*2 R30: FSQRT ; make random number FADD [RandAdd] FMUL [RandFactor] FPREM ; get decimals FST QWORD PTR [BX] ; only lower 32 bits used, rest overwritten next time ADD BX, 4 DEC DX JNZ R30 FSTP ST(0) ; discard seed FSTP [randp1] ; initialize bit 64-79 MOV [p1], (KK-JJ-1)*8 ; initialize buffer pointers MOV [p2], (KK-1)*8 CALL Random2 ; prepare first random number RET 8 ; RET 0 if _cdecl calling RandomInit ENDP _TEXT ENDS END ; ----------- comment % Example of calling from C++ program: extern "C" void _stdcall RandomInit (double); extern "C" long double _stdcall Random1 (void); int i; RandomInit (time(NULL)); for (i=0; i<100; i++) { printf ("\n%14.10f", (double)Random1());} // if _stdcall not supported, use _cdecl in stead and replace // RET 8 with RET 0 in asm code. % An important design goal has been to make the code fast. Thanks to the shuffling of bits, it has been possible to dispense with the carry from the lower 32 bits to the upper 32 bits and still get a good performance. This makes it possible to do the two ADD instructions simultaneously in the two pipelines of the Pentium. This saves 2 clock cycles which more than compensates for the one clock cycle used by the (unpairable) rotate instruction. With great joy I have found that the improved algorithm actually is slightly faster than the ADDGEN algorithm! 5. Theoretical analysis and discussion -------------------------------------- The additive generator (ADDGEN) was chosen as a starting point because it is fast and because its state is defined by more than one variable which gives the potential for a high cycle length. The cycle length is defined as the number of times the random function can be called before the sequence is repeated. Obviously, the cycle length cannot exceed the number of possible states. In many random number generators it is much less than that. Since cycle lengths ideally should be too long to measure experimentally, a theoretical analysis is needed. I will start my discussion of cycle lengths with the case of a system with N possible states where it can go from one state to another in a totally chaotic fashion without following any rules, except that it is deterministic. Assume that we have called this chaotic function i times going through i different states. The probability that the next state will be one that we have seen before is i/N. We continue the process until this happens, when i=c. The most likely value of the cycle length c will be close to sqrt(2*N) because the summation of i/N will exceed 1/2 at this value. I have found that a system based on the decimals of a squareroot behaves in this manner (used in the RandomInit function above). Starting this system with different initial seeds, I found different trajectories which all sooner or later joined together and ended up into the same cycle with a length of the expected order of magnitude. It is very possible that this system has other much smaller cycles which I didn't find because the probability of getting into them was small. In the following, I am using the value of c = sqrt(2*N) as a reference for evaluating how good other systems are. Now, let's analyze the ADDGEN system. It consists of a circular buffer of k binary numbers of b bits each. The number of possible states is N = 2^(k*b). This system is based on addition operations, where the carries trickle from the low bits up through the high bits. There is no information flowing the other way, from the high bits to the low bits. For this reason, the least significant bits of the k numbers constitute an independent system which cannot be influenced by the higher bits. I will call this system ring 0, i.e. the least significant bits of the k numbers. I will now analyze ring 0 alone. The state where it contains all zeroes will go to itself because adding zeroes can produce nothing but zeroes. This gives a cycle length of one. The remaining 2^k-1 states may provide one or more cycles. The sum of the lengths of all the cycles is of course equal to the number of possible states. I have found experimentally, that the cycles either all have the same length, or there is a simple relationship between the cycle lengths. The length of the biggest cycle is always equal to the least common multiple of all other cycles in the system. Ring 1, the ring of the least significant but one bits, forms an identical system to ring 0, except for the fact that the carries from additions in ring 0 are added to ring 1. If we turn off the carries for a moment and analyze the combined system of ring 0 and ring 1, without carries, then we have two identical systems. The two rings may happen to be in different cycles, so the cycle length of the combined system should be the least common multiple of the cycle lengths of the two systems. But since the highest cycle length of a single ring is allready equal to the least common multiple of all other cycles in that ring, then the maximum cycle length of the two-ring system (without carries) can be no longer than the maximum cycle length of the single-ring system. If we allow carries from ring 0 to ring 1, then the cycle lengths will be doubled. When we finally expand the system to b rings, then we have b identical ring systems where the carries from each ring is added to the next ring in a manner where the contributions of carries from each ring are purely additive. The maximum cycle length is doubled for each extra ring that is added. Example: An ADDGEN system with k=5, j=1, b=1 has the following cycles: 3, 7, 21 (ignoring the trivial state of all zeroes). Expanding this to two rings (b=2) gives cycle lengths 3, 6, 7, 14, 21, 42. b=3 gives: 3, 6, 12, 7, 14, 28, 21, 42, 84. The conclusion is that the maximum possible cycle length of an ADDGEN system is (2^k-1)*2^(b-1) ~= 2^(k+b-1). This is much lower than sqrt(2*N) = 2^(k*b/2), due to all the regularities in the system. In order to improve this situation I have developed the SHUFFLADD algorithm defined above. I am shuffling the bits in such a way that information goes not only from the low bits to the high bits through the carries, but also the other way. In fact, I have made the design in such a way that information can go from any bit in the buffer to any other bit. This can be verified experimentally by changing a single bit anywhere in the buffer, and after a sufficient number of calls you will get a completely different sequence from what you would if you hadn't changed that single bit. No information is lost. Testing the SHUFFLADD system experimentally, I found that the cycle length usually was bigger than sqrt(2*N). The reason for this is as follows: In the chaotic system described above there is no rule preventing two different states from going to the same successor state. But the SHUFFLADD algorithm is a reversible function. Each state has one and only one precursor state. It is no problem to apply the algorithm backwards and recall previous random numbers in the sequence. (The ADDGEN algorithm is also reversible). We therefore have to modify the argument we used above to calculate the cycle length of the chaotic system. After having passed through i different states in the SHUFFLADD algorithm we want to calculate the probability that the next state will be one that we have seen before. This probability is no longer i/N, but 1/(N-i) because only our initial state can close the circle, the i-1 states in between are not possible successors. Summing this probability, we find that we most likely have to try more than N/2 times before we hit the initial state again. In other words, the cycle length is likely to be somewhat longer than N/2. (Forgive me for the laziness of not calculating the exact formula for the sum of 1/(N-i)). Now, assuming that we have found a cycle of length c1 = N/2, we have to analyze the remaining states. We can divide the set of possible states into two subsets: one subset comprised of the states in the c1 cycle, and another set containing all other states. This second subset has the same quality as the whole set of states: namely that all states in this subset must go to a state in the same subset. Searching for a cycle in the second subset the same way we did before, we will most likely find a cycle length c2 slightly larger than half the size of the subset, or N/4. Using this argument recursively, we should expect to find cycles of lengths approximately N/2, N/4, N/8, N/16, etc. down to 1. Verifying this experimentally is only possible for unrealistically small systems. For example in a SHUFFLADD system with b=6, r=1, k=4, j=1, I found the following cycle lengths: 13,053,066; 2,590,080; 562,305; 247,197; 101,212; 94,527; 90,601; 16,503; 7,485; 6,739; 3,829; 2,094; 915; 359; 288; 14; 1; 1. The total number of cycles in SHUFFLADD systems not too big to analyze experimentally was found to be 0.8*k*b on average. Plotting the binary logarithms of the cycle lengths, I found that these were randomly scattered in the interval between 0 and k*b. Unlike the ADDGEN system there was no evidence of any relationship between cycle lengths. Only very rarely did I find two cycles of the same length (except for length 1), and a prime factorization of the cycle lengths showed that they share no more prime factors than you would expect from completely random numbers. This, of course, holds only when the design rules are obeyed, i.e. r must be relatively prime to b, j must be relatively prime to k, and k-j must be odd. The only regularity I found was that there was always two cycles of length one. One of these was the obvious case where the buffer contained all zeroes. The other one was less obvious. If the distribution of cycle lengths sketched here also holds for systems too big to investigate experimentally, then we can calculate the probability of getting into an unacceptably small cycle as approximately cmin/2^(k*b-1). For k=17 and b=64 we find that the probability of getting into a cycle smaller than Cmin = 10^12 is 10^(-315). I think we can live with this risk! We must conclude that the shuffling of bits has increased the cycle length tremendously. The only problem is that the more random we make the system, the more difficult it becomes to analyze theoretically. We have made a system which seems to be much better, but we cannot prove theoretically that it is. A further design goal was to minimize the correllation between subsequent numbers in a sequence. The shuffling of bits has probably also improved the performance in this respect, although both the ADDGEN and SHUFFLADD generators have passed all my tests. 6. Request for feedback ----------------------- If there is any math genius out there who can refine my analysis then I would be grateful to hear from you. My heuristic arguments about cycle lengths rely on the assumption that the SHUFFLADD algorithm has no regularities which can reduce cycle lengths, such as the ADDGEN has. Proving this theoretically is beyond my powers. And an experimental proof is only possible for systems too small to be useful for the intended purposes. Also, if anybody can prove, theoretically or statistically, that the random numbers produced by my algorithm are less than perfectly random, then please let me know. 7. Conclusion ------------- The SHUFFLADD algorithm defined here produces pseudo random numbers with a flat distribution in the interval 0 <= X < 1. The resolution is 10^(-19). The cycle length is virtually infinite. I have found no measureable deviations from perfect randomness. The shuffling of bits also means that we do not need as big a buffer as for the ADDGEN to obtain a satisfactory performance. The optimized code executes in just 20 clock cycles on a Pentium processor. This means that it can produce 10 million random numbers per second on a 200MHz processor. This document is also published at my www site together with a list of 10000 random numbers. You will find it at: http://announce.com/agner/random This URL will be updated late this year with more stochastic functions. =============================================================================== Agner Fog, Ph.D. Pentium optimation manual at agner@login.dknet.dk http://announce.com/agner/assem