Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Medium]Switching to an external/more powerful random number generator #184

Closed
Feynstein opened this issue May 8, 2020 · 18 comments
Closed

Comments

@Feynstein
Copy link

Feynstein commented May 8, 2020

This weekend I will try to branch from your project in order to change the random number generation algorithms that seems to be a bit outdated. I will use this library: https://gitlab.cern.ch/CLHEP/CLHEP
and add it as a submodule to your git. And make sure the algorithm choice is a parameter. I might add a boost program options part to deal with parameters also... I haven't had time to check how they are handled. This might be meaning the removal of the Rand.cpp file and related methods, which would make it more modular and with a better overall quality. It might also positively affect performance since this library has been going on for a while and has nice optimizations. I don't know which C++ standard they use, so yeah it might also mean moving up to C++11 or more if it was not already there.

@weshinsley
Copy link
Collaborator

weshinsley commented May 8, 2020

Thank you opening this, and for comments elsewhere. Appreciate your time and effort, as well as the things you are exploring in the code.

RANLIB has been around for a while to be sure, and review of our use of it is welcome. It may be that current practice in your field is ahead of ours.

The important things for us about any RNG would be (1) speed, as we make heavy use of sampling, (2) a thread-safe RNG-per-thread one way or another, and (3) we want repeatability from a given seed; those last 2 are about thread-safety and repeatability; we want to generate deterministic results from given seeds. There are always discussions about "how random is random", especially with different seeds per thread, and whether we can improve on RANLIB in that respect is a good question.

We then would typically do large numbers (10,000s) of runs with different seeds to look at the ensemble properties in terms of noise/stochasticity.

@Feynstein
Copy link
Author

Nice, thank you for your comment. I knew there were real people behind this :-)
10,000 runs are a piece of cake in particle physics we do a few million events in order to get better accuracy. But that might not be the same thing in that case. And yes, I saw your Rand lib and it's pretty stocked with embedded for loops which is not something you want for speed. I can try to do a first open mp run on the already existing one, but I think that scraping it completely might be a better idea CLHEP is a very powerful tool.

@Feynstein
Copy link
Author

Oh and btw can you make sure I get a few thumbs up in the other issue about the original source :P we need to show these guys what's the reality behind all this. I knew it was something with grad students piling over with new features and never actually refactoring it... this is something i've seen too often in academia XD

@Feynstein
Copy link
Author

@weshinsley Oh yes, this library is used in multi-thread - single seed - Monte Carlo high energy physics code

@Feynstein Feynstein changed the title Switching to an external/more powerful random number generator [CS-1][Medium]Switching to an external/more powerful random number generator May 8, 2020
@Feynstein Feynstein changed the title [CS-1][Medium]Switching to an external/more powerful random number generator [Medium]Switching to an external/more powerful random number generator May 8, 2020
@weatherhead99
Copy link

why not use e.g. std::binomial_distribution and std::poisson distribution etc? Looks like a step up from the current implementation and much simpler switch than going to CLHEP (which for large runs does make sense)

@Feynstein
Copy link
Author

Feynstein commented May 8, 2020

I can't seem to see from the std::poisson reference page if they're static function or not. I'm not sure if they hold a state? The goal here is repeatability, so they need to hold their state. But I'm not familiar with them. It also seems to me that there are plenty occurences of their use here and there in the code... And I think that the minimum run holds 20GB of ram so in my idea all runs where big... But it's worth looking into, they might be more optimized. There are no references in the examples about seeding... Ohh... ok I see it now... well nice!

@weshinsley
Copy link
Collaborator

weshinsley commented May 9, 2020

I raised concerns about the std:: stats library a while back (https://stackoverflow.com/questions/34977728/c-gamma-distribution-returning-infinity/34981326#34981326) and trying to get a fix for that was a tedious and uncertain process; I got the "we can't reproduce this" response for a while, later an acknowledgement it was being looked into, but never got a clear answer on whether this was fixed or not.

That wasn't in the context of this model, but there was no perceived benefit at that time of changing the RNGs, and if we had had any thoughts of moving to std:: at that time, we would have been increasingly cautious to do so as a result.

Relying on open source libraries we can include in full, that are demonstrably very close to the maths has considerable value, and I think it remains to be shown whether there is a driving need for anything external or more powerful than the simple and classical that we are using.

But that is what the open-source process is about, so we appreciate this discussion.

@Feynstein
Copy link
Author

Feynstein commented May 9, 2020

@weshinsley I was looking into CLHEP more closely and I remembered how elegantly the units were implemented in Geant4. There is also a whole lot more than the random number generators. I think that at some point maybe some of your more mathematical phenomenons could be described by linear algebra stuff like fields or matrices and others that might be more appropriate for the specific job at hand, and that is very rigourously implemented in CLHEP...

@weatherhead99
Copy link

@weshinsley just to ask if you're aware of a reproduction of this result in any compiler more recent than VS2010? Is there a bugid for it?
Also, that all standard libraries are now open source. I think the advantage I'd perceive is that you could drop a lot of custom RNG code, which in particular reading through it is very clearly not thread safe.

@StevenWhitehouse
Copy link
Contributor

I just did a very quick test, to compare the ranf routine in this project with the generator which is recommended in Numerical Recipes (3rd ed, p. 342 which they call Ran). That generator appears to run at around the same speed as ranf:

[steve@fogou test]$ ./rtest
ranf 0.896546 secs
Ran 0.893404 secs

In both cases thats the time taken to generate 100,000,000 pseudo-random doubles in a uniform distribution between 0 and 1 when run on my laptop.

On page 354 of Numerical Recipes, they give the code for a generator which they claim is faster, if you are specifically after floats between 0 and 1 only (the Ran generator uses mostly integer maths, and can generate psuedo-random integers too). This second generator is not as good as Ran though, it appears. I don't know whether it would make sense to use here, since I'm not quite sure what is "good enough" in this case, so I've not gone the additional step to test it and see how much faster it might be.

I'm happy to share my test code if it is of help, but really all I've done is use the routine in the book. Note the earlier editions of Numerical Recipes have different recommended random number generators, so if you have an older edition, that will be different code.

Hopefully that is a useful data point.

@weshinsley
Copy link
Collaborator

@weatherhead99 - as reported in the stackoverflow issue, that was with 2016 compilers. No bugid because there was no public issue tracker to submit bugs against std:: at that time, only email support - also explaining my caution. But not helpful to suggest "clearly not thread safe". It is completely thread-safe in all our Intel Inspector tests, and the code makes sense to us. Submit an issue if you have a counter-example.

@StevenWhitehouse - thanks, but beware licencing of Numerical Recipes. Although we all have the book on our shelves, the licensing is not at all simple, and restricts the open source usage we want. In any case, no strong motivation for us to change what we use.

@StevenWhitehouse
Copy link
Contributor

@weshinsley Yes, licensing can be an issue, although I'm sure we could figure out a way around that by finding a different/similar implementation if it was worthwhile. In this case though, based on the results that I've got, I suspect that any gains that way will be minimal. Also there is potentially a significant investment in time to check a new RNG meets the requirements too, which is not worth it without a significant gain in speed to offset that.

So I think next time I find a spare moment, I'll start looking at the next level up, i.e. the consumers of the random numbers, and see if I spot anything there. At the moment I'm just spending a little time checking things that I hope might be useful to look at, but if you have a todo list or something like that, it might be helpful as a guide.

@StevenWhitehouse
Copy link
Contributor

Some more random thoughts...

I moved my investigation on step up the stack to look at the callers of ranf() and there are a large number of them. So I started with the callers in Rand.cpp, which are generating various distributions. There does seem to be a lot of code duplication there, and in fact ranf and ranf_mt are almost identical, so there is scope for some clean up there.

I also noticed what I think (but have not yet fully investigated) is duplication between snorm and gen_norm which appear to be different routines for generating the same distribution. Although gen_norm takes the usual mean and sd parameters, they are constant 0 and 1 in all callers, but the compiler should sort that out for us.

So the next question was how fast are those routines, with gen_norm using the Marsaglia polar method and snorm looks like it is using a rejection method thats been translated from Fortran. At the moment I've not verified that both are doing the same thing, but the results from a quick test of their relative performance is interesting:

[steve@fogou test]$ perf record ./ntest
snorm 2.568932 secs
gen_norm 4.037334 secs

So snorm seems much faster than gen_norm, but is it? The answer turns out to be no, since gen_norm actually can generate two values per call, but currently doesn't. The net result is that it could be twice as fast as it is at the moment, so that seems like something worth looking into.

An initial test added a static variable to store the second value, and return it on every other call, and then the performance looked like:

[steve@fogou test]$ ./ntest
snorm 2.608799 secs
gen_norm 2.110180 secs

So almost a 2x gain in performance. That isn't thread safe, of course, but it shows that gen_norm can be trivially speeded up. So next step was to make it thread safe. I looked at ranf and ranf_mt to see how that had been done, and the answer seemed to be that it was done badly. The Xcg1 & 2 variables are much larger than actually required, 512 bytes per thread. The CACHE_LINE_SIZE which is in bytes was being multiplied by the sizeof(long) resulting in a very large size. Also, Xcg1 & 2 for the same thread definitely do want to be in the same cache line! So having them separate like that makes no sense at all.

So with all that in mind, I came up with a patch, which I'll now attempt to copy & paste into here :-)

You'll likely not be able to make this apply if you copy & paste it from here, so I'll send it to anybody who want me to email it to them. It should give a rough idea of what I'm thinking though. Benefits are:

  • nearly 2x faster gen_norm
  • reduced code duplication
  • reduced size of per thread data
  • Xcg1, Xcg2 and ncache in the same structure for locality
  • concept could be extended to give additional clean up of some of the other functions in Rand.cpp (I've not looked at them yet)

Caveat is that it needs careful review and testing. Also, I don't dare replace snorm calls with gen-norm until I'm 100% sure that they are the same, but there would be a small additional performance benefit if that was possible. Hopefully this is useful info :-)


diff --git a/src/Rand.cpp b/src/Rand.cpp
index c0fa05e..3a47d86 100644
--- a/src/Rand.cpp
+++ b/src/Rand.cpp
@@ -8,13 +8,22 @@
 #ifdef _OPENMP
 #include <omp.h>
 #endif
+
 /* RANDLIB macros*/
 #define ABS(x) ((x) >= 0 ? (x) : -(x))
 #define minF(a,b) ((a) <= (b) ? (a) : (b))
 #define maxF(a,b) ((a) >= (b) ? (a) : (b))
 
+struct rngstate {
+       long Xcg1;
+       long Xcg2;
+       double ncache;
+       unsigned char __pad[CACHE_LINE_SIZE-2*sizeof(long)-sizeof(double)];
+};
+
+static struct rngstate rngstate[MAX_NUM_THREADS];
+
 /* RANDLIB static variables */
-long* Xcg1, *Xcg2;
 int **SamplingQueue = nullptr;
 
 ///////////// ********* ///////////// ********* ///////////// ********* ///////////// ********* ///////////// ********* ///////////// ********* 
@@ -23,44 +32,30 @@ int **SamplingQueue = nullptr;
 
 double ranf(void)
 {
-       long k, s1, s2, z;
-       unsigned int curntg;
+       unsigned int tn;
  #ifdef _OPENMP
-       curntg = CACHE_LINE_SIZE * omp_get_thread_num();
+       tn = omp_get_thread_num();
 #else
-       curntg = 0;
+       tn = 0;
 #endif
-       s1 = Xcg1[curntg];
-       s2 = Xcg2[curntg];
-       k = s1 / 53668L;
-       s1 = Xa1 * (s1 - k * 53668L) - k * 12211;
-       if (s1 < 0) s1 += Xm1;
-       k = s2 / 52774L;
-       s2 = Xa2 * (s2 - k * 52774L) - k * 3791;
-       if (s2 < 0) s2 += Xm2;
-       Xcg1[curntg] = s1;
-       Xcg2[curntg] = s2;
-       z = s1 - s2;
-       if (z < 1) z += (Xm1 - 1);
-       return ((double)z) / Xm1;
+       return ranf_mt(tn);
 }
+
 double ranf_mt(int tn)
 {
        long k, s1, s2, z;
-       int curntg;
 
-       curntg = CACHE_LINE_SIZE * tn;
-       s1 = Xcg1[curntg];
-       s2 = Xcg2[curntg];
+       s1 = rngstate[tn].Xcg1;
+       s2 = rngstate[tn].Xcg1;
        k = s1 / 53668L;
        s1 = Xa1 * (s1 - k * 53668L) - k * 12211;
        if (s1 < 0) s1 += Xm1;
        k = s2 / 52774L;
        s2 = Xa2 * (s2 - k * 52774L) - k * 3791;
        if (s2 < 0) s2 += Xm2;
-       Xcg1[curntg] = s1;
-       Xcg2[curntg] = s2;
+       rngstate[tn].Xcg1 = s1;
+       rngstate[tn].Xcg2 = s2;
       z = s1 - s2;
        if (z < 1) z += (Xm1 - 1);
        return ((double)z) / Xm1;
@@ -92,8 +87,8 @@ void setall(long *pseed1, long *pseed2)
        long iseed2 = *pseed2;
 
        for (g = 0; g < MAX_NUM_THREADS; g++) {
-               *(Xcg1 + g * CACHE_LINE_SIZE) = iseed1 = mltmod(Xa1vw, iseed1, Xm1);
-               *(Xcg2 + g * CACHE_LINE_SIZE) = iseed2 = mltmod(Xa2vw, iseed2, Xm2);
+               rngstate[g].Xcg1 = iseed1 = mltmod(Xa1vw, iseed1, Xm1);
+               rngstate[g].Xcg2 = iseed2 = mltmod(Xa2vw, iseed2, Xm2);
        }
 
        *pseed1 = iseed1;
@@ -2042,27 +2037,16 @@ S140:
  */
 double gen_norm(double mu, double sd)
 {
-       double u, v, x, S;
+       unsigned int tn;
 
-       do
-       {
-               u = 2 * ranf() - 1; //u and v are uniform random numbers on the interval [-1,1]
-               v = 2 * ranf() - 1;
-
-               //calculate S=U^2+V^2
-               S = u * u + v * v;
-       } while (S >= 1 || S == 0);
-
-       //calculate x,y - both of which are normally distributed
-       x = u * sqrt((-2 * log(S)) / S);
-
-       // This routine can be accelerated by storing the second normal value
-       // and using it for the next call.
-       // y = v * sqrt((-2 * log(S)) / S);
-
-       //return x
-       return x * sd + mu;
+#ifdef _OPENMP
+       tn = omp_get_thread_num();
+#else
+       tn = 0;
+#endif
+       return gen_norm_mt(mu, sd, tn);
 }
+
/*function gen_snorm
  * purpose: my own implementation of sampling from a uniform distribution, using Marsaglia polar method, but for multi-threading
  *
@@ -2071,6 +2055,13 @@ double gen_norm(double mu, double sd)
 double gen_norm_mt(double mu, double sd, int tn)
 {
        double u, v, x, S;
+       double t;
+
+       if (rngstate[tn].ncache) {
+               t = rngstate[tn].ncache;
+               rngstate[tn].ncache = 0.0;
+               return t * sd + mu;
+       }
 
        do
        {
@@ -2082,15 +2073,17 @@ double gen_norm_mt(double mu, double sd, int tn)
        } while (S >= 1 || S == 0);
 
        //calculate x,y - both of which are normally distributed
-       x = u * sqrt((-2 * log(S)) / S);
+       t = sqrt((-2 * log(S)) / S);
+       x = u * t;
 
        // This routine can be accelerated by storing the second normal value
        // and using it for the next call.
        //      y = v * sqrt((-2 * log(S)) / S);
+       rngstate[tn].ncache = v * t;
 
-       //return x
        return x * sd + mu;
 }
+
 /*function gen_gamma
  * purpose: my own implementation of sampling from a gamma distribution, using Marsaglia-Tsang method
  *
diff --git a/src/Rand.h b/src/Rand.h
index 927e48f..ad2974d 100644
--- a/src/Rand.h
+++ b/src/Rand.h
@@ -11,7 +11,6 @@
 
 /* RANDLIB global variables */
 extern int **SamplingQueue;
-extern long* Xcg1, *Xcg2;
 /* RANDLIB functions */
 long ignbin(long, double);
 long ignpoi(double);
diff --git a/src/SetupModel.cpp b/src/SetupModel.cpp
index ce6f9f1..3b4661b 100644
--- a/src/SetupModel.cpp
+++ b/src/SetupModel.cpp
@@ -32,8 +32,6 @@ void SetupModel(char* DensityFile, char* NetworkFile, char* SchoolFile, char* Re
        char buf[2048];
        FILE* dat;
 
-       if (!(Xcg1 = (long*)malloc(MAX_NUM_THREADS * CACHE_LINE_SIZE * sizeof(long)))) ERR_CRITICAL("Unable to allocate ranf storage\n");
-       if (!(Xcg2 = (long*)malloc(MAX_NUM_THREADS * CACHE_LINE_SIZE * sizeof(long)))) ERR_CRITICAL("Unable to allocate ranf storage\n");
        P.nextSetupSeed1 = P.setupSeed1;
        P.nextSetupSeed2 = P.setupSeed2;
        setall(&P.nextSetupSeed1, &P.nextSetupSeed2);

@weatherhead99
Copy link

@weshinsley for example the function SampleWithoutReplacement() is not thread safe, it uses a global pointer which does not have thread local storage nor is guarded by any synchronisation primitives.

Lack of thread safety does not imply the appearance of non-deterministic thread related errors, such as would be imported by a tool such as Inspector. As far as I can see that particular example can't result in a deadlock or a logical flow related race hazard, merely possibly a data race.

@StevenWhitehouse
Copy link
Contributor

A random history...

When trying to figure out how to improve something, it is always good to know where you are coming from, so with that in mind, I've tracked down some info about the current random number generator. Hopefully this info is useful... I have some further details that I hope to post in due course, but this seems to be a self-contained topic, and maybe it will be useful to others looking at this.

The random number generator currently in use as ranf was originally described in this paper:
http://www-labs.iro.umontreal.ca/~lecuyer/myftp/papers/cacm88.pdf
which was published in CACM in June 1988. Fig 3 on page 747 should look very familiar. L'Ecuyer then published a second paper, which is referenced in the code (which helps!) relating to how to set the initial seeds:
https://dl.acm.org/doi/pdf/10.1145/103147.103158

L'Ecuyer's random number generator uses two independent LCGs which have been chosen to give a long period, the text claims circa 2.3e18. That seems like it should be more than adequate for this situation since the test data uses (I measured this) approx 1.7e9 random numbers for the longest simulation. So no concerns there.

L'Ecuyer's generator is quite obviously popular, a quick google for some of the constant values reveals a number of other users. The one that seems to top the list (for me, anyway) is gnuplot. Since we mentioned Numerical Recipes above, it is worth noting that this generator appears in their 2nd Ed (On p.273, in my Fortran copy - I gave my 2nd Ed C copy away when I bought the 3rd Ed!) in combination with a Bays-Durham shuffle which was added to break up any serial correlations in the output, implying that they were concerned that the basic algorithm was not good enough. This algorithm was named ran2.

By the time of the 3rd Ed Numerical Recipes, they are warning about the perils of RNGs which rely solely on LCGs due to the generated numbers lying on a limited number of hyperplanes (if one considers vectors of N consecutive random numbers as points in N dimensional space). Their newly recommended generator (Ran) combines a 64 bit LCG (post processed with a simple hash function) with a 64 bit xorshift generator (see this paper by Marsaglia https://www.jstatsoft.org/article/view/v011i05 ) and also an MWC, which are all combined at the output.

It is worth noting there that the ranf generator only generates 32 bits of output, so although my quick test above suggested it was the same speed as Ran, since one could split the 64 bits from Ran into two 32 bit words (not something you'd want to do with an LCG only generator!) it is effectively twice the speed.

There are clearly things that could be done to improve the ranf function, but whether it would make any difference to the quality of the results depends on how those numbers are used. I don't have enough understanding of what is going on at the next stage up the stack to be sure. Based on my research so far, though, it probably would make sense to change to something more modern. Probably a 64 bit LCG, plus output hash, in parallel with an xorshift would be enough. There is no need to use any of the NR code, since there are plenty of other references to these techniques and implementation is straightforward.

Having done a run of the test through perf, the results suggest that however lightning fast the RNG might be, it will have little effect on the overall speed of the simulation. Approx 50% of the cpu time was spent in InfectSweep and only 2.71% in ranf in my test. The first few lines of the output from perf are included here in case it is useful:

Overhead Command Shared Object Symbol

........ ......... .................................... ..............................................................................................................

49.24%  CovidSim   CovidSim                              [.] InfectSweep
 5.78%  CovidSim   CovidSim                              [.] AssignPeopleToPlaces
 5.75%  CovidSim   CovidSim                              [.] IncubRecoverySweep
 5.05%  CovidSim   libm-2.30.so                          [.] __ieee754_pow_fma
 4.95%  CovidSim   CovidSim                              [.] dist2
 3.66%  CovidSim   CovidSim                              [.] numKernel
 2.71%  CovidSim   CovidSim                              [.] ranf_mt
 1.76%  CovidSim   CovidSim                              [.] DoInfect
 1.57%  CovidSim   CovidSim                              [.] CalcPersonSusc
 1.55%  CovidSim   CovidSim                              [.] ignbin_mt
 1.43%  CovidSim   CovidSim                              [.] CalcSpatialInf
 1.22%  CovidSim   CovidSim                              [.] CalcPersonInf
 1.20%  CovidSim   CovidSim                              [.] dist2UTM
 1.17%  CovidSim   CovidSim                              [.] CalcHouseInf
 1.16%  CovidSim   CovidSim                              [.] RecordSample
 1.09%  CovidSim   CovidSim                              [.] CalcPlaceInf

Anything below this line used less than 1% of the cpu time. Oh, and I did also confirm that snorm is basically the same thing as gen_norm(0, 1) having tracked down the paper about that too:
https://pdfs.semanticscholar.org/8ee8/979cc90816e6b286535d198ca93794085af5.pdf

While we are on the subject of papers, and noting also comments on SampleWithoutReplacement() above, the comment in the code claims that it is based upon the SG algorithm from this paper:
https://dl.acm.org/doi/pdf/10.1145/214392.214402

A quick inspection shows that only the "else" part of the big if statement there appears to be related to the SG algorithm, and even then it doesn't look quite the same. I've not had a chance to check in any detail though, so there may be more to it... thats probably enough work on this topic for this weekend :-)

@weshinsley
Copy link
Collaborator

weshinsley commented May 20, 2020

@weatherhead99 - can you be any more specific on what you've observed? What global pointer do you refer to in SampleWithoutReplacement? The only non-local I can see is SamplingQueue[tn][...] - where tn is thread number, making it safe.

@StevenWhitehouse
Copy link
Contributor

I've turned some of the ideas above into a pull request. It doesn't fix everything, but since this will be my first pull request, I thought it was a good plan to show what I'd got so far and get some feedback before going too far down the line. Hopefully it is useful :-)

@weshinsley
Copy link
Collaborator

Issue clean-up - closing comments:-

  • We merged a tidy-up of the existing RNG package.
  • There is no pressing feeling that the existing RNG is inadequate.
  • Moreover, current RNG is thread-safe, we have full source of it, and it is lightweight enough to perform well.
  • We don't have the need for a "more powerful" RNG especially, and we don't want to rely on an external package.

#455 discusses the potential (I think) of making a better interface so it is easy to pick/choose RNG - I suggest any other discussion on RNGs continues there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants