| 1 |
This library has no dependencies other than the latest versions of Phobos and DMD. To build, |
|---|
| 2 |
simply unpack all the files into an empty directory and do a: |
|---|
| 3 |
|
|---|
| 4 |
dmd -O -inline -release -lib -ofdstats.lib *.d |
|---|
| 5 |
|
|---|
| 6 |
You can also combine dstats with other libraries, etc. as you see fit. I intend to keep the |
|---|
| 7 |
build process trivial for the foreseeable future, so that dstats is as easy as possible to set |
|---|
| 8 |
up and the barrier to entry is as low as possible. |
|---|
| 9 |
|
|---|
| 10 |
Conventions of this library: |
|---|
| 11 |
|
|---|
| 12 |
1. A delicate balance between ease of use, flexibility and performance should be maintained. |
|---|
| 13 |
There are tons of good libraries for hardcore numerics programmers that emphasize performance above |
|---|
| 14 |
all else. There are also tons of good statistics packages for people who are basically |
|---|
| 15 |
non-programmers and aren't doing large-scale analyses or analyses in the context of larger programs. |
|---|
| 16 |
The distribution seems very bimodal. This library tries to target the middle ground and recognize |
|---|
| 17 |
the principles of tradeoffs and diminishing returns with regard to performance, flexibility |
|---|
| 18 |
and ease of use. |
|---|
| 19 |
|
|---|
| 20 |
2. Heap allocations should be minimized. Whenever temporary space needs to be allocated internally, |
|---|
| 21 |
the call stack or TempAlloc is used if possible. This allows good multithreaded performance, which |
|---|
| 22 |
matters, for example, when computing large correlation matrices or performing statistical tests |
|---|
| 23 |
on every exon in the human genome. |
|---|
| 24 |
|
|---|
| 25 |
3. Everything should work with the lowest common denominator generic range possible. It's |
|---|
| 26 |
frustrating to have to write tons of boilerplate code just to translate data from one format into |
|---|
| 27 |
another. Also, oftentimes even if the data is in the form of an array it needs to be copied so it |
|---|
| 28 |
can be reordered without the reordering being visible to the caller. In these cases, it can be |
|---|
| 29 |
copied just as easily whether the input data is in the form of an array or some other range. |
|---|
| 30 |
|
|---|
| 31 |
4. Throwing exceptions vs. returning NaN: The convention here is that an exception should be |
|---|
| 32 |
thrown if a primitive parameter (i.e. an int or a float) is not in the acceptable range. This is |
|---|
| 33 |
because such things can trivially be checked upfront and should not occur by accident in most cases, |
|---|
| 34 |
except for the case of bugs internal to dstats. If the errant function parameter is the dataset, |
|---|
| 35 |
i.e. a range of some kind, then a NaN should be returned, because when doing large-scale analyses, |
|---|
| 36 |
a few pieces of data are expected to be defective in ways that are not easy to check upfront and |
|---|
| 37 |
should not halt the whole analysis. |
|---|
| 38 |
|
|---|
| 39 |
In general, this means that dstats.distrib and dstats.gamma should throw on invalid parameters, |
|---|
| 40 |
and all other modules should return a NaN. Any other result is most likely a bug. |
|---|
| 41 |
Cases where dstats.tests calls into dstats.distrib, resulting in thrown exceptions, are |
|---|
| 42 |
unfortunately too common and need to be fixed. |
|---|
| 43 |
|
|---|
| 44 |
5. License: Each file contains a license header. All modules that are exclusively written by |
|---|
| 45 |
the main author (David Simcha) are licensed under the Boost license, so that pieces of them may |
|---|
| 46 |
freely be incorporated into Phobos and attribution is not required for binaries. Some modules |
|---|
| 47 |
consist of code borrowed from other places and are thus required to conform to the terms of these |
|---|
| 48 |
licenses. All are under permissive (i.e. non-copyleft) open source licenses, but some may require |
|---|
| 49 |
binary attribution. |
|---|