Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Ticket #795 (closed enhancement: fixed)

Opened 12 years ago

Last modified 11 years ago

toHash and getHash functions provided in TypeInfo are of poor quality

Reported by: Deewiant Assigned to: fawzi
Priority: normal Milestone: 0.99.8
Component: Core Functionality Version: trunk
Keywords: Cc:

Description

The hash functions which the TypeInfos? provide aren't particularly good. I quote from TypeInfo?_Struct: "A sorry hash algorithm."

I'm not terribly familiar with the subject, but the array one, for instance, just adds up the hashes of its elements. This means that "abc" and "bca" and "cba" hash to exactly the same thing, unless I'm mistaken. Surely we can do better.

Some links on the subject:

http://www.azillionmonkeys.com/qed/hash.html http://burtleburtle.net/bob/hash/

Telling quote from google-sparsehash: "As an example, in one test of the default SGI STL string hash function against the Hsieh hash function, for a particular set of string keys, the Hsieh function resulted in hashtable lookups that were 20 times as fast as the STLPort hash function."

I haven't tested STLPort, but I wouldn't be surprised if it were to outperform Phobos's/Tango's current hashes.

Change History

11/29/07 19:52:19 changed by kris

"good" is somewhat in the eye of the beholder. Good dispersal, good speed, or a good balance? Walter went for good speed, I suspect. But yeah, the algorithm applied upon text arrays is quite pitiful, and not much faster than a tweaked JHash.

We've been meaning to replace that with the one used elsewhere in Tango (you can find it in tango.text.Util). I've always found Bob Jenkin's approach to provide a good balance, and the Paul Hsieh variation is not bad either. We did quite a bit of testing in the past, with various algorithms.

02/23/08 05:22:44 changed by kris

  • milestone set to 0.99.6.

04/22/08 00:27:21 changed by sean

  • milestone changed from 0.99.6 to 1.0.

This is a non-critical enhancement request so I'm pushing it off to 1.0 pending further prioritization.

07/26/08 21:27:48 changed by sean

  • status changed from new to assigned.

I'd originally avoided changing this because it was a non-essential diff from the Phobos runtime. However, I've come to feel that Walter will never take over maintenance of the Tango DMD runtime anyway, so there's little point in preserving old functionality in hopes that this will increase the likelihood that he will do so. I'll fix this before 1.0.

11/19/08 12:34:58 changed by larsivi

  • owner changed from sean to fawzi.
  • status changed from assigned to new.

11/21/08 10:58:22 changed by fawzi

One often has to build hashes of composite objects. If the order of the elements is arbitrary and non fixed (keys in an hash table) then summing the hashes is a good solution. When changing the order changes the object (a string for example) then one should chain the hashes with a non commutative operation.

1) Normally one can chain hashes with basically no cost, so one could add an optional parameter (that defaults to 0), and that can be used to chain hashes, to getHash and toHash (faster if keys are not already an hash (I think the common case))

2) a a more expensive a slightly more cumbersome solution is to use one-at-a-time hash (splitted in combine and finish op) to combine hashes if one does want non commutativity. (slightly faster (spare an if) when keys are typically already hashes)

As we change something, I think that it worthwhile to change it well and change the hash type from int to size_t. This has some consequences, and needs some changes in the pieces that use the has on 64 bit platforms, but I think that it is the right choice.

I have looked at http://www.burtleburtle.net/bob/hash/doobs.html that gives a nice overview of all hashing options. The idea would be to port loockup3.c and loockup8.c (32/64 bits) in the runtime as infrastructure that can be used to build hashes, and use it for arrays (and thus strings) and associative arrays (that are another ticket). If solution 2 is choosen then maybe also one-at-a-time should be there. This might be useful also to others.

Moving it in the runtime would make Utils version redundant.

11/21/08 13:52:02 changed by Deewiant

http://murmurhash.googlepages.com/ looks pretty cool too—twice as fast as lookup3, better distributed, and less biased. (For pretty pictures which tell part of the story, see http://murmurhash.googlepages.com/avalanche).

I think hashing of composite objects should just be defined separately for each type. As you say, for hash tables, summing isn't necessarily a bad solution. Still, I think the option of a one-at-a-time hash should be there somewhere (unless we can show that it's never needed; I doubt we can).

I agree that hash_t should definitely be size_t.

11/21/08 14:50:58 changed by fawzi

yes murmurhash looks nice, so at least for 32 bits we want that.

About composite objects yes each type should do it but it can be more or less easy to implement. Imagine an object containing a,b,c

solution 1:

hash_t getHash(hash_t s){
  s=a.getHash(s);
  s=b.getHash(s);
  s=c.getHash(s);
  return s;
}

solution 2:

hash_t getHash(hash_t s){
  hash_t r=a.getHash(r);
  r=oneAtTimeMix(r,b.getHash());
  r=oneAtTimeMix(r,c.getHash());
  return oneAtTimeFinish(r);
}

solution 2 is slightly slower and more cumbersome, but implementing getHash for something that is already hashed with solution 1 is like this:

hash_t getHash(hash_t s){
  if (s==0){
    return myValue;
  } else {
    return oneAtTimeHash(s,myValue);
  }
}

which is slightly slower (but I would say that this is not the common case).

11/22/08 04:45:52 changed by kris

murmerhash does look nice

03/10/09 12:08:30 changed by fawzi

  • status changed from new to closed.
  • resolution set to fixed.

(In [4400]) shared typeinfo and utils between compilers. better hashing closes #795 , refs #988

03/10/09 19:36:16 changed by larsivi

  • milestone changed from 1.0 to 0.99.8.