Ticket #1155 (closed enhancement: fixed)

Opened 16 years ago

Last modified 15 years ago

tango.math.Random improvement

Reported by:	fawzi	Assigned to:	kris
Priority:	major	Milestone:	1.0
Component:	Tango	Version:	trunk
Keywords:	random	Cc:

Description

Tango has two random number generators (that are actually the same underlying one): tango.math.Rand and tango.math.Random .

There is no reason for it. Rand are structures, and one also supports generation of double numbers between 1 and 0, whereas Random contains a class.

I think that using a class is the right approach, the structure might have a minor speed advantage, but it is easy to "forget" to update the original structure when passing it as argument. Due to this the sequence gets repeated, which is some cases can have very bad consequences.

To have speed it is better to use methods that set chunks of values together, those methods often can then be optimized more effectively than the ones that return the single value. A single kind of random number generator is enough if it is a good random number generator (and KISS is such a random number generator), so if one really wants more speed it is possible to set the whole class as final (at the moment most methods are final).

Something that would be nice to have is a jumpahead method to skip a given number of elements, WichmannHill? has this (L'Ecuyer 2001 has a c++ implementation but it is not free for commercial use), but probably seeding derived generators with a master KISS generator is good enough.

If one wants also to have the system random number generator (where available) probably it makes sense to have a different interface, as its throughput is much smaller it should be used differently. Otherwise if used correctly (using array access where needed) one could make the class (and methods) non final, but at the moment I see no need for it.

Something that one might still need is numbers with other distributions, but this can be added on the top of the current generator. One should be able to easily have random int,uint,float,double and reals, there is no reason to use separate structures for this.

Checkpointing/saving the state and restoring it is also very important to be able to reproduce a given program behaviour.

So I had a go at Random.d and added random float,double and reals, randomize(T[]) for different T, and checkpoint ability. I think that this makes Random much better, and makes Rand deprecated, I join a patch for it

Attachments

patchRandom.txt (15.6 kB) - added by fawzi on 06/19/08 13:50:17.: patch for random (apply with patch)
Random.d (80.3 kB) - added by fawzi on 07/05/08 08:16:49.: Complete rewrite of Random with many enhancements, added back .shared static method
random.tar (110.0 kB) - added by fawzi on 07/07/08 00:27:49.: my Random.d rewrite and splitted in several modules
random.zip (38.9 kB) - added by fawzi on 11/14/08 08:51:52.: the proposal of new random package

Change History

06/19/08 15:08:20 changed by fawzi

actually I I understood correctly tango still uses the old version of KISS, the new one (KISS99) is better, and it might be a good idea to implement it (not tested, just pasting here as reference)

 uint x = 123456789,y = 362436000,z = 521288629,c = 7654321; /* Seed variables */ 
 
uint KISS(){
   const ulong a = 698769069UL;    
   ulong t;
   x = 69069*x+12345;  
   y ^= (y<<13); y ^= (y>>17); y ^= (y<<5);  
   t = a*z+c; c = (t>>32); 
   z=cast(uint)t;
   return x+y+z;  
}

with seeding subject to the following constraints...

void init_KISS() 
{ 
   x = devrand(); 
   while (!(y = devrand())); /* y must not be zero */ 
   z = devrand(); 
 
   /* Don’t really need to seed c as well but if you really want to... */ 
   c = devrand() % 698769069; /* Should be less than 698769069 */ 
}

07/04/08 15:28:44 changed by fawzi

I did not disappear :)

As this is something that I need (I am writing an random testing framework, and I do simulations that need good Rngs) I went on an tried to write a first class random number generator.

It did need some more iterations than I expected, but I like the result very much. Modestly I think that it has the best Rng interface that I know :)

Anyway I join my full rewrite, and at the beginning of it there is a quite long description explaining usage and some design decisions.

I just repeat here the annoyances:

I have put quite some functionality in a single module, maybe some might prefer to split it up
I have added two "next" methods to RandomG for backward compatibility reasons, but I have removed .shared instance from Random, that has been replaced by the "rand" object. The idea behind this is that RandomG is a template and rand it should be shared across all templates. If the name rand is considered bad one could change it, or even reintroduce a .shared static method that returns rand.
you cannot initialize a static array directly, this because randomize is declared like this:

U randomize(U)(ref U a) { }

and a static array cannot be passed by reference. Removing the ref would make arrays initialized, and scalar not, which is much worse.

I will for sure use my module, but if it gets into tango... well all the better :)

07/07/08 00:38:37 changed by fawzi

I have splitted my rewrite in several modules that are rooted in frm: frm.random.engines.KISS frm.random.engines.Urandom frm.random.engines.ArraySource? frm.random.engines.CMWM frm.random.ExpSource? frm.random.NormalSource? frm.random.Ziggurat frm.random.Random then frm should be replaced by either tango or tango.math . The main module relevant to the user is frm.random.Random It gives easy to use shared object rand, and a drop-in replacement for tango.math.Random. It has a much richer interface, there are examples in the file random.Random.d file, here is an excerpt

    - shared generator for quick usage available through the "rand" object
      int i=rand.uniformR(10); // a random number from [0;10)
    - simple Random (non threadsafe) and RandomSync (threadsafe) types to 
      create new generators (for heavy use a good idea is one Random object per thread)
    - several distributions can be requested like this
        rand.distributionD!(type)(paramForDistribution)
      the type can often be avoided if the parameters make it clear.
      From it single numbers can be generated with .getRandom(), and variables
      initialized either with call style (var) or with .randomize(var).
      Utility functions to generate numbers directly are also available.
      The choice to put all the distribution in a single object that caches them
      has made (for example) the gamma distribution very easy to implement.
    - sample usage:
      auto r=new Random();
      int i; float f; real rv; real[100] ar0; real[] ar=ar0[];
      // initialize with uniform distribution
      i=r.uniform!(int);
      f=r.uniform!(float);
      rv=r.uniform!(real);
      foreach (ref el;ar)
        el=r.uniform!(real);
      // another way to do all the previous in one go:
      r(i)(f)(rv)(ar);
      // unfortunetely one cannot use directly ar0...
      // unifrom distribution 0..10
      i=r.uniformR(10);
      f=r.uniformR(10.0f);
      rv=r.uniformR(10.0L);
      foreach (ref el;ar)
        el=r.uniformR(10.0L);
      // another way to do all the previous in one go:
      r.uniformRD(10)(i)(f)(r)(ar);
      // uniform numbers in [5;10)
      i=r.uniformR2(5,10);
      // uniform numbers in (5;10)
      f=r.uniformR2(5.0f,10.0f);
      rv=r.uniformR2(5.0L,10.0L);
      foreach (ref el;ar)
        el=r.uniformR2(5.0L,10.0L);
      // another way to do all the previous in one go:
      r.uniformR2D(5.0L,10.0L)(i)(f)(r)(ar);
      // uniform distribution -10..10
      i=r.uniformRSymm(10);
      // well you get it...
      r.uniformRSymmD(10)(i)(f)(r)(ar);
      // any distribution can be stored
      auto r2=r.uniformRSymmD(10);
      // and used later
      r2(ar);
      // complex distributions (normal,exp,gamma) are produced for the requested type
      r.normalSource!(float)()(f);
      // with sigma=2
      r.normalD(2.0f)(f);
      // and can be used also to initialize other types
      r.normalSource!(float)()(r)(ar);
      r.normalD(2.0f)(r)(ar);
      // but this is different from
      r.normalSource!(real)()(i)(r)(ar);
      r.normalD(2.0L)(i)(r)(ar);
      // as the source generates numbers of its type that then are simply casted to
      // the type needed.
      // Uniform distribution (as its creation for different types has no overhead)
      // never cast, so that (for example) bounds exclusion for floats is really
      // guaranteed.
      // For the other distribution using a distribution of different type than
      // the variable should be done with care, as underflow/overflow might ensue.
      //
      // some utility functions are also available
      int i2=r.uniform!(int)();
      int i2=r.randomize(i); // both i and i2 are initialized to the same value
      float f2=r.normalSigma(3.0f);

07/08/08 09:09:45 changed by fawzi

I totally agree that KSS is a serious issue. I already wrote Rng, and I use them, so I know a little the use space, and the current design is the result of some iterations, in which I did remove also features (exactly to avoid KSS syndrome). I am still convinced my design is sound, but design issues might not be apparent from the beginning (especially to me, I need some iterations to get things right). I think that the best way to sort things out is to discuss with others, so I will try to explain my design decisions (sorry in advance for the long post).

* serialization: This might be an issue in games for example, where to be able to reproduce what did happen, an issue that can be avoided by never using seed(), but only manual seeding (maybe even multiple times, before critical sections, because it might be difficult to make the program run exactly the same way if it is interactive). Still if you want to be random you should seed with something random, so you end up doing what seed() does, but in your code, so that you know the seed.

So for gaming (and most other uses of random) serialization is very nice to have (just store the status somewhere in a file, or print it just in case something strange happens, as I do in the tests of Random, and if you need it you can ask the user to send it to you for debugging).

Where serialization become a must if you write something like a poker game, or a serious simulation where you want to sample something with "real" randomness, and periodically reseeding is not an option.

So I think that serialization in the whole is something useful that should be there. Now about the serialization method I thought Reader/Writer was the standard tango way, but I might have been mistaken, please tell me what is the correct way. I wanted to have also a "readable" string approach. What would have been nice to have (and I think technically possible) is to have a Reader/Writer that serializes to xml-like (or similar human readable) format. As this is not (yet?) possible with tango, I wrote a fromString toString.

* synchronization I wanted to have 0 impact of synchronization for people that do not need it (I think that is almost all users of a well designed high random number using applications, that should have a random source per thread that needs it). So the solution is (I think) quite nice: use a synchronized engine for synchronized usage, a non synchronized one if you don't need it. Random and RandomSync? aliases are easy-to-use objects for people that don't care about the engine.

* RandomG is big Well that was not directly said so, but I agree that it might seem so, but this is the result of the following design decisions, and I think the code is quite well segmented (i.e. not too big and unstructured).

- randomization for arrays

This is to have maximum speed for those that want it (and there are those that need millions of random numbers and care about each cycle). I did not optimize each array randomizer, but I wanted the API to be such that one can do such optimizations without changing the it. Actually in the patch to Random that I checked in I had written optimized methods for floats/double/reals (using 32/52 bits generalizing next32/next52), but then I removed them, and with very little overhead wrt. next52 one can do full mantissa initialization, and I also decided to remove next32, because I though that if someone cares so much about speed over accuracy on doubles (not on floats) then he is better off doing it himself by hand getting uints from the source, this usage does not warrant the more complex API (am I right? wrong? well I don't know but it was just to show that I do care about KSS, to the point of scrapping optimized tested code).

- uniform source

Uniformly distributed numbers are heavily used, must be optimized, and some people care about little details, so I thought the warrant a quite comprehensive interface, but I thin still manageable (fullRange,0..1,0..to,from..to,-to..to with various bound inclusions for floats)

- other sources/distributions in the RandomG class

well this is a decision that might stir some controversy, but I think is the correct one. A more modular approach would be to have uniform source on one side, and other built on the top of it, completely separately. The problem is that these sources have a non negligible setup cost (both in time and space), and normal sources for example might be actually quite used around the code. So you might end up with multiple copies in your code. So a wrapper for all sources that caches them is useful, this makes the creation of sources that need several other distributions (like the gamma that needs gaussian and uniform) much more efficient and better.

Still one might have a wrapper object, that contains all the distributions, and the uniform distribution into it. I discarded this approach because the uniform distribution is so commonly used that I did not want an extra indirection to reach it. So basically I merged the wrapper object that contains all the distributions into the uniform distribution.

The code for this is actually quite clean because all the other distributions are defined in other modules, and you have just few utility method setting them up and making them available from RandomG. The cost of these distributions is just an empty pointer in randomG if they are not used. There is an exception: the Gamma distribution, it is just few lines of code and I have put it into RandomG, but it could be moved away to its own module if one wants to be more strict about the module separation.

- two ways to initialize a random number

One has always the possibility to either use a utility method that returns the random number, or a call-like initializing interface. The call-like initializing interface is more general, but it might be "strange" for some users, so I kept both.

- no direct support for pure real rng

there are some Rng that produce directly real numbers, but their quality is inferior, and integrating them with the other in a nice interface is not so easy, so I excluded them

So here are my thoughts, right? wrong? well any design has its tradeoffs, but I think that this is quite balanced, but shoot away about its shortcomings :)

Fawzi

07/08/08 13:36:24 changed by fawzi

Yes in some cases the application can log the seed, but it is for sure more cumbersome than being able to checkpoint your random number generator at any point. For example in a game the state of the Rgn might be difficult to reproduce, because the number of random numbers requested depends on user input. So what do you do if you cannot checkpoint? well you can reseed periodically (ugly but possible). But to reseed if the seed is just one uint then after just 10_000 seeds you have a probability bigger than 1% to pick a duplicate. Acceptable? sometime yes, but not always.

Furthermore if you have a program where you want to produce exactly the same results when run in parallel, sometime it is nice to be able to send the state of the Rng around, so you don't have to do exactly the same operations on all processors. So checkpointing for most uses it is not unavoidable, but it is definitely nice to have, it is critical for a small class of problems where it is important to have real randomness and not loose it through reseeding.

I was thinking to be more tango-like to use the serialization interface to checkpoint a Rng, but if it is such an issue toString-fromString are sufficient, in any case I think that the benefit/cost ratio of the ability to checkpoint (if possible, Urandom for example cannot do it) is definitely favorable to it (just its debugging utility is huge).

(follow-up: ↓ 29 ) 10/15/08 15:23:40 changed by Don Clugston

Yup, blip.random -> tango.math.random. Although, see below.

A few comments: * The code in Random.d has a lot of duplication. That's something we can address later, though. I suspect that if I added a couple more primitives to tango.math.IEEE, the floating point stuff would get much easier. * Is it really necessary for Random to pull in IWritable and IReadable? Why not just expose the Source.InternalState?? (The fact that a random number generator has an internal state seems to be fundamental, not an implementation detail). * Some random libraries I've seen (boost, for example) group statistical distributions together with random functions. I'm not sure that it's a good idea, but I'd like your opinion. * Does the existing MersenneTwister? fit in? * Personally I don't like big heirarchies, and I'd like to minimise the number of modules which users know about. Right now it seems that lots of the code is implementation details; I'd like it to be more obvious which modules should be used.

Ideally a single import would be adequate for the majority of cases. And my guess is, that someone who wants some sophisticated distribution, is going to want others as well. I see the use cases as: 1. 'I just want a random integer for my game' 2. 'I want the standard set of distributions for my monte carlo simulator' 3. (Extremely rare) 'I want to create my own distribution'. (Cryptography is of course completely different).

With 1, we have the existing import tango.math.random.Kiss; For case 2, it would be nice if only 1 or 2 imports were required. It seems a bit clumsy for the user to have to do:

import tango.math.random.engines.KissCmwc?;

unless they are doing something really unusual.

* Any comments you'd like to make on my functions in tango.math would be most welcome. It hasn't been reviewed at all, really, and I don't consider myself an expert on library design. Examples from users saying "this is really clumsy" are very valuable.

(in reply to: ↑ 28 ; follow-up: ↓ 31 ) 10/15/08 21:11:00 changed by fawzi

Replying to Don Clugston:

Yup, blip.random -> tango.math.random.

just a doubt I had could tango.math.random.* conflict with tango.math.Random in case insensitive filesystem (the files could coexist, but dsss& co might get confused... Then tango/math/Random.d would need to go...

Although, see below. A few comments: * The code in Random.d has a lot of duplication. That's something we can address later, though. I suspect that if I added a couple more primitives to tango.math.IEEE, the floating point stuff would get much easier.

Yes the floaing point generation is a little bit messy. The reason is that I wanted to initialize the whole mantissa. Even if one does randomLong/long.max for a float this is not equivalent (but almost) because due to the scientific notation one can represent numbers that are much smaller than 1/long.max. Doing full mantissa initialization if done carefully does not really cost more in 99% of the cases, and has better properties.

* Is it really necessary for Random to pull in IWritable and IReadable?

not really, just a way to save and restore the internal state.

Why not just expose the Source.InternalState?? (The fact that a random number generator has an internal state seems to be fundamental, not an implementation detail).

could be an idea (note that the random device has no "real" internal state, but most have it, and there should be a way to store and recover it.

* Some random libraries I've seen (boost, for example) group statistical distributions together with random functions. I'm not sure that it's a good idea, but I'd like your opinion.

I don't know, I should sleep first :)

* Does the existing MersenneTwister? fit in?

yes one could add it as engine (I was just lazy)

* Personally I don't like big heirarchies, and I'd like to minimise the number of modules which users know about. Right now it seems that lots of the code is implementation details; I'd like it to be more obvious which modules should be used.

actually in 99% of the cases one should use random.Random

rand (alrady initialized, rng threadsafe) Random (non threadsafe) and RandomSync? (threadsafe) types

these object offer uniform,normal,exponential,gamma distributions and are what one should normally use.

If one cares a lot about the performace or statistical property of the source then RandomG!(engine) is the type for a generator using a user defined engine

The default one should be reasonably fast and good, but if one really wants more control he should choose his engine.

Ideally a single import would be adequate for the majority of cases. And my guess is, that someone who wants some sophisticated distribution, is going to want others as well. I see the use cases as: 1. 'I just want a random integer for my game' 2. 'I want the standard set of distributions for my monte carlo simulator' 3. (Extremely rare) 'I want to create my own distribution'.

(Cryptography is of course completely different).

just use the /dev/random engine or an engine you cook up

With 1, we have the existing import tango.math.random.Kiss; For case 2, it would be nice if only 1 or 2 imports were required. It seems a bit clumsy for the user to have to do: import tango.math.random.engines.KissCmwc?;

this is needed only if you want a special engine of that type

unless they are doing something really unusual. * Any comments you'd like to make on my functions in tango.math would be most welcome. It hasn't been reviewed at all, really, and I don't consider myself an expert on library design. Examples from users saying "this is really clumsy" are very valuable.

Thanks for the feedback! I will think about the interfaces in math...

ciao Fawzi

(in reply to: ↑ 29 ) 10/16/08 07:14:13 changed by Don Clugston

Replying to fawzi:

* The code in Random.d has a lot of duplication. That's something we can address later, though. I suspect that if I added a couple more primitives to tango.math.IEEE, the floating point stuff would get much easier.

Yes the floaing point generation is a little bit messy. The reason is that I wanted to initialize the whole mantissa. Even if one does randomLong/long.max for a float this is not equivalent (but almost) because due to the scientific notation one can represent numbers that are much smaller than 1/long.max. Doing full mantissa initialization if done carefully does not really cost more in 99% of the cases, and has better properties.

We definitely don't want a division in there. I was thinking of providing a primitive which pokes the value straight into the mantissa. We could do this since we have IEEE arithmetic.

Actually, I'm not really sure what "a uniform distribution between 0 and 1" means for floating-point. You can easily do a uniform distribution between 1 and 2, and then subtract 1, but after the subtraction it isn't uniform in IEEE space any more.

(follow-up: ↓ 33 ) 11/14/08 08:49:41 changed by fawzi

I was away, and then occupied fighting threading issues, but now I am back. I attach a zip file with the files renamed, without IWriter/IReader interface, and with KeYeR mersenne twister as possible engine.

About the uniform distribution in (0,1). The thing is that you don't want to initialize only the mantissa, but also the exponent. The trick to do it with the right weight is the following: imagine a random number in (0,1) in binary notation. It starts with 0. then has a sequence of 0 and 1 each with probability 1/2. So the idea is to create such a sequence and store it in the float, stopping as soon as you reach the accuracy of the float. Most of the time for a float if you start with 32 bits this is i/2^{32 (where the division can be done at compile time).
If the sequence starts with more than 8 zeros (probability 1/2}8), then you need more bits to fully initialize the mantissa (that has 24 bits, and implicit 1 at the beginning and 23 bits for normal numbers) as the exponent becomes -8 (or even smaller) to accommodate the initial zeros.

This is especially useful for floats, that have a limited mantissa, I read an article about some people testing the tail of gaussian distributions (important for insurances) and artifacts were visible without full mantissa initialization.

99.6% of the time (1-1/2^{8) the code is as fast as the normal way (i/2}32), and in the last cases it is more complex, but not exceedingly slow, so I think that one can use it, is is marginally slower, and has better properties.

(in reply to: ↑ 32 ) 11/14/08 10:53:06 changed by Don Clugston

Replying to fawzi:

I was away, and then occupied fighting threading issues, but now I am back. I attach a zip file with the files renamed, without IWriter/IReader interface, and with KeYeR mersenne twister as possible engine.

For my part, I've been fighting a dying laptop, and then updating Agner Fog's obj converter to support DMD. Finally my new PC is under control. But I won't get a chance to look at your code for at least a week.

About the uniform distribution in (0,1). The thing is that you don't want to initialize only the mantissa, but also the exponent. The trick to do it with the right weight is the following: imagine a random number in (0,1) in binary notation. It starts with 0. then has a sequence of 0 and 1 each with probability 1/2. So the idea is to create such a sequence and store it in the float, stopping as soon as you reach the accuracy of the float.

Cool, I didn't think of that. That's clearly correct.

Ticket Navigation

Ticket #1155 (closed enhancement: fixed)

tango.math.Random improvement

Description

Attachments

Change History

06/19/08 13:50:17 changed by fawzi

06/19/08 15:08:20 changed by fawzi

06/19/08 15:34:09 changed by Don Clugston

06/19/08 16:17:22 changed by fawzi

06/19/08 19:08:17 changed by Don Clugston

06/20/08 09:43:22 changed by larsivi

06/20/08 09:43:33 changed by larsivi

06/20/08 13:21:07 changed by Don Clugston

06/20/08 13:44:32 changed by Don Clugston

07/04/08 15:28:44 changed by fawzi

07/05/08 08:16:49 changed by fawzi

07/06/08 16:45:15 changed by larsivi

07/06/08 17:10:34 changed by mandel

07/06/08 17:13:58 changed by mandel

07/06/08 19:42:18 changed by fawzi

07/07/08 00:27:49 changed by fawzi

07/07/08 00:38:37 changed by fawzi

07/07/08 16:17:35 changed by kris

07/07/08 20:20:31 changed by fawzi

07/08/08 00:55:43 changed by torhu

07/08/08 05:06:49 changed by kris

07/08/08 09:09:45 changed by fawzi

07/08/08 09:50:07 changed by larsivi

07/08/08 13:36:24 changed by fawzi

07/23/08 10:33:27 changed by fawzi

10/06/08 07:08:24 changed by kris

10/13/08 15:49:38 changed by Don Clugston

10/13/08 18:49:45 changed by kris

10/14/08 13:12:51 changed by fawzi

(follow-up: ↓ 29 ) 10/15/08 15:23:40 changed by Don Clugston

(in reply to: ↑ 28 ; follow-up: ↓ 31 ) 10/15/08 21:11:00 changed by fawzi

10/16/08 02:55:14 changed by kris

(in reply to: ↑ 29 ) 10/16/08 07:14:13 changed by Don Clugston

(follow-up: ↓ 33 ) 11/14/08 08:49:41 changed by fawzi

11/14/08 08:51:52 changed by fawzi

(in reply to: ↑ 32 ) 11/14/08 10:53:06 changed by Don Clugston

11/19/08 19:39:16 changed by fawzi

Download in other formats: