HashMap and HashSet Associative Array implementation for D

Features that restrict the implementation

Value reference returned by "in", should be valid after rehash ( but not after a remove). This restricts the implementation to using pointers to AA node entries.

Other features of AA implementations

The principle of Associative Arrays is to convert a key into a hash value as an index for a fast random access lookup into an array.

The table entry index is calculated as hash % table.length.

Table sizes are commonly selected from a table of prime numbers. This reduces index collisions, as compared to using powers of 2 for table sizes.

Implementations commonly define a Node data structure, which encapsulates the key, value association, and most often caches the hash value of the key as well.

The primary table stores Node pointers, and the Nodes are chained in a single linked list.

This implementation allows a loading ratio to be specified, and allows an initial capacity to be set. This turns out not to be much of a performance feature, more of a fine tuning.

So there are C interface functions to set and get capacity, and load ratio.

Node manual clean up.

Calling remove with an AA key deletes the AA node and will invalidate any pointers to the previous held value. There is no way to clean up all the nodes in one go except for calling remove for each and every key value. The initial version 2.043 has a bug in the opApply loop that prevents doing this while doing a foreach.

If all alias references to the AA disappear before all the nodes are deleted, then the runtime relies on the Garbage Collector to delete the pointer table and its nodes.

If some of the keys look like pointers, the GC may be unable to delete the nodes or the table. With a lot of random key values, this will lead to memory allocation and node insertion slowdown each time a Garbage Collection is run. The slowdown gets progressively larger with repeated runs, despite calling the GC to collect between runs.

There is a therefore a C function to support clear operations.

Hash table load and capacity managment

There is a default load factor and capacity management based on the size of the embeded table of prime numbers which approximately are related to each other by a factor of 4. The rehash threshold factor is also a factor of 4.

The function relating resize threshold (capacity) to table length and load ratio is

capacity = ( table length ) * ( load ratio).

The more or less random distribution of hash values results in a distribution of hash chain lengths that become longer with more nodes or bigger load ratio.

After a rehash, the load factor of the table (number of nodes to pointer table length) can be anything between about 0.25 to 1.0. This range is very good for lookup performance, and for most applications the extra memory used for the pointer array is not noticed.

It is possible better control over the loadRatio and capacity using a bigger table of prime numbers.

The gnu/hash.c , has successive prime number values that approximately a factor 2. A better populated prime table allows a finer control of capacity.

During insertions by the DPL builtin hash table, the load ratio oscilates between 0.25 and 4.

Therefore at least have doubled the size of the table of primes to be the same as those found in gnu/hash.c. The grow_rehash algorithm has been adjusted to maintain the same insertion growth.

Enhanced AA for DRuntime

Although the current implementation works very well, I thought the interface and implementation could be enhanced to provide more features without sacrificing current functionality.

I have written an alternative compatible replacement AA for D runtime library with the following features.

It still supports and includes the original C interface functions. Calls to the original functions from unmodified D source code work the same way.

Shared code modules must be using the same Druntime library implementation of course.

As before, all functions in this module use void* pointer for the implementation object reference.

The code was written from the original aaA.d as a starting point, and adds some new C interface functions which share the implementation.

Associative arrays variables using the direct value[key] declaration use the old C interface.

AA Capacity

A preallocation of table node memory can be done, by passing a none zero value in the intial setup method or by setting capacity on the AA when its length is zero.

The preallocation / capacity feature, if used with an HashMap or HashSet with no nodes yet allocated, or used in the setup method, will result in a preallocation of memory for at least the specified number of nodes. For large tables this will result 50-60% faster insertion speed.

If preallocation is done, Node memory management will use an internal instance associated BlockHeap, instead of using the GC to allocate and free memory directly. Most likely the memory will not be returned to the GC until the AA becomes GC unused or its clear method is called. Since the BlockHeap allocates only in large chunks of 4 KB , capacity/preallocation is best used for AA sets which expect thousands of nodes. The value passed to capacity should be the at least a known minimum expected loading. Pre-allocating a maximum value which is rarely reached will waste a lot of memory.

If capacity is not set when length is zero, or preallocation is not used, the AA will continue to directly use the GC.

Regardless of preallocation status, the capacity property cannot be set to be less than the current number of nodes. Assigning a value to capacity may resize the hashtable according to the expected capacity and current value of load ratio.

HashMap

A new template wrapper has been created for builtin HashMap,which uses the current and additional C functions in aaA.d

This looks and behaves similar to the original, but has additional methods and properties.

Some of the new methods supported:

bool putInsert(K key, ref V value) Insert or replace. Return true if the value was inserted.
bool contains(K key) This is fairly obvious (already had "in" operator), but is most useful with the HashSet (below).
bool remove(K key, ref V value) Remove the key, but return the value and if the key existed.
public int eachKey(int delegate(ref K key) dg) Allows trawling through just the keys. The key values cannot be changed. This makes sense for the HashSet, supported by _aaApplyKeys function.
@property void clear Clean up each node. Supported by _aaClear. The reference reverts to being an empty table.
static DRHashMap assumeUnique(ref DRHashMap other) Returns a new HashMap reference with the dataset of the other. The old references point to empty table.
@property DRHashMap dup Copy all the data into a new HashMap reference
void setup(bool hashSmall = false) Configure as a new HashMap reference. Integer/float/char types will not be hashed, unless the hashSmall flag is true. This is a runtime flag that can only be done during setup. The advantage is about 30% increase in performance speed of insertions and lookups, plus a reduction of Node size.

HashSet

As well, to take advantage of the C interface, a template wrapper for a HashSet collection has been created.

This acts like a bool[ keyType ] , where the bool value is virtual for opIndex and opIndexAssign. Only keys are stored in the AA.

Can use OpApply to do foreach across all the keys. Can do remove during foreach, but cannot modify the value of the keys.

The HashSet has no equivalent D runtime type that I know of. It is an AA with the value size set to zero.

This implementation ignores the valuesize passed to it by many of the old C functions, and stores the initial value size in the AA reference.

Most of the new functions no longer pass the TypeInfo for the key and the size of the value.

bool put(K key) Insert if not already exists. Return true if the value was inserted.
bool contains(K key) HashSet has no "in" operator. There is no value type to return.
bool remove(K key) Remove the key,return if the key existed.
public int opApply(int delegate(ref K key) dg) Allows trawling through just the keys. The key values cannot be changed. Supported by _aaApplyKeys function.
@property void clear Clean up each node. Supported by _aaClear. The reference reverts to being an empty table.
static DRHashSet assumeUnique(ref DRHashMap other) Returns a new HashSet reference with the dataset of the other. The old references point to empty table.
@property DRHashSet dup Copy all the data into a new HashSet reference
void setup(bool hashSmall = false) Configure as a new HashSet reference. Integer/float/char key types will not be hashed, unless the hashSmall flag is true. This is a runtime flag that can only be done during setup. The advantage is about 30% increase in performance speed of insertions and lookups, plus a reduction of Node size.

D runtime C function interface

In all of these the AA value type is passed by cast(void*), excepting a few where the address of the AA value type is used. In the new implementation arguments to the original functions that are thought to be superfluous are bracketed with Unused, because the implementation object caches these values on initialization. Some of new C functions do the same job as the old, minus some arguments that do not seem to be needed.

Original interface

size_t _aaLen(AA aa); Return number of AA nodes
void* _aaGet(AA* aa, TypeInfo keyti, size_t valuesize, ...) Last argument is of AA key type. Returns a pointer for storing a value for insertion or replacement. The keyti and valuesize arguments are used if initialisation is required. The new wrapper interface does not use _aaGet.

void* _aaGetRvalue(AA aa, TypeInfo keyti, size_t valuesize, ...) Last argument is of AA key type. Look up for opIn_r. Return null or pointer to existing value.

Unused (keyti, valuesize).

void* _aaIn(AA aa, TypeInfo keyti, ...) This does exactly the same thing as _aaGetRvalue. Unused (keyti)
void _aaDel(AA aa, TypeInfo keyti, ...) Does the remove. Last argument is of AA key type. Unused(keyti)
ArrayRet_t _aaValues(AA aa, size_t keysize, size_t valuesize)
void* _aaRehash(AA* paa, TypeInfo keyti) Unused(keyti). The extra indirection and return type is not necessary now.
ArrayRet_t _aaKeys(AA aa, size_t keysize) Unused(keysize)
ArrayRet_t _aaValues(AA aa, size_t keysize, size_t valuesize) Unused(keysize, valuesize)
int _aaApply(AA aa, size_t keysize, dg_t dg) Unused(keysize) extern (D) typedef int delegate(void *) dg_t;
int _aaApply2(AA aa, size_t keysize, dg2_t dg) Unused(keysize) extern (D) typedef int delegate(void *, void *) dg2_t;
BB* _d_assocarrayliteralT(TypeInfo_AssociativeArray ti, size_t length, ...) The ti arguments is used for initialisation. The varargs are the literal AA entries to be assigned. Returns a new AA implementation.
int _aaEqual(TypeInfo_AssociativeArray ti, AA e1, AA e2) Returns true only if both AA are the same object, or have the same length and same set of key, value entries.

Added Interface functions

bool _aaDelNode(void*,...) Same as _aaDelNode, minus keyti
bool _aaDelGetValue(void*,...) key, value* 2 Extra arguments are a Key type, and a pointer to the Value type. Supports bool AA.remove(key, ref Value), ie Simultaneous get and remove.
void* _aaGetNodeValue(void*, ...) key Extra argument is Key type. Same as _aaIn, _aaGetRValue. Returns a pointer to Value, or null.
void _aaClear(void*) Equivalent to calling remove for every key.
size_t _aaGetCapacity(void*) Return current rehash threshold
void _aaSetCapacity(void*, size_t cap) This will resize the table according to the current load ratio. Capacity is a number of nodes, and cannot be set to less than the current number of nodes.
If the number of nodes is zero, the AA will switch to using a specialized heap for node allocation and pre-allocate the nodes.
void _aaSetLoadRatio(void*, double ratio) Sets the load ratio value used by the capacity calculation.
double _aaGetLoadRatio(void*) Read the current load ratio.
void* _aaPutInsert(void*, ...) key Equivalent to _aaGet, except prior initialisation is assumed and not checked. For insert or replace value. Extra argument is of Key type.
bool _aaContains(void*, ...) key Return if the extra argument of key type exists in the AA.
int _aaApplyOne(void*, dg_t dg) Equivalent to _aaApply
int _aaApplyTwo(void*, dg2_t dg) Equivalent to _aaApply2
bool _aaDirectHashType(TypeInfo keyti) Actually not an AA function, but classifies the key type as being storable directly in the hash field.
ArrayRet_t _aaStats(void*) [buckets, empty count, #(1), #(2) ...] Return uint[] array, of at least length 2, describing the bucket distribution of chain lengths
ArrayRet_t _aaGetKeys(void*) Equivalent to _aaKeys
ArrayRet_t _aaGetValues(void*, TypeInfo valti); Equivalent to _aaValues. The second argument is necessary because no guarantee of internal initialisation with TypeInfo_AssociativeArray, because old interface still supported.
size_t _aaGetKeyValues(void*, ArrayRet_t*, ArrayRet_t* , TypeInfo valti) Get arrays of both keys and values in same order at same time. This maybe superfluous.
void _aaResize(void*) Equivalent to the _aaRehash, with no indirection in argument.
void* _aaAssumeUnique(void*); Returns a new implementation object holding all the data. Old object and aliases become empty. Similar usage to std.contracts.assumeUnique, only for wrapper.
void* _aaInitAA(void*, TypeInfo_AssociativeArray, bool hashSmall, uint preAllocate) Initialise the array. Init checking is done by the wrapper. Setting the hashSmall argument to true prevents the DirectHashType field key storage. Optionally preallocate node memory and use a specialised node heap.
void* _aaInitCopySetup(void**, void* ) Initialise the array with settings from another array. This is again only internal use in the wrapper.
void* _aaInitHashSet(void*, TypeInfo keyti, bool hashSmall, uint preAllocate);

Initialise the array as a key only set. Optionally preallocate node memory and use a specialised node heap.
int _aaApplyKeys(void*, dg_t dg); Iterate through the keys with the delegate type. The delegate is given a pointer to a copy of each key.

Benchmark program

Example of command line usage: aatest-r.exe n=30 m=1101008 -i hashlist -t uu -l 4

n=<number of runs> ; Can also override number of runs in data file

m=<number of keys>

-g <data file path> ; generate a data file, must also include n, m and -t options

-d <data file path> ; read a data file, uses the stored n, m and -t options

-l load ratio, use of depends on implementation, value between 0.5 and 16.0

-t <key value type>

uu uint[uint]
su uint[string]
cs uint[WrapString]
cu uint[WrapUint]
us string[uint]

-i implementation

hashtree Template version of the builtin AA, with more features. Array of node pointers
hashlist Single linked list version, Array of node pointers
hashptr Open addressing version, linear congruential sequence. Array of node pointers
hashtablist Direct Node storage plus single link chaining. Array of node records
pydict Direct Node storage and open addressing. Array of node records
builtin Call the druntime version
builtin-hm Call the druntime version using DRHashMap

Please do your own testing.

The benchmark source main module is http://www.dsource.org/projects/aa/browser/trunk/testhash.d

Installation (and removal)

Replace the aaA.d in dmd2\src\druntime\src\rt with the trunk\druntime\aaA.d.

Run the makefile in dmd2\src\druntime. On windows a prebuilt minit.obj will be required.

Edit the win32.mak, and comment out the minit.asm line with a '#'.

Also rebuild phobos using dmd2\src\phobos\win32.mak.

My little batch file, run from C:\D\dmd2\src, assuming PATH setup for D2, is :-

set MYDIR=%CD%

cd %MYDIR%\druntime

del lib\druntime.lib

make -f win32.mak

cd %MYDIR%\phobos

make -f win32.mak

copy /Y phobos.lib ..\..\windows\lib\

cd %MYDIR%

Recompile the test application, using the hash.druntime wrappers.