C# Hidden Danger #2: GetHashCode()
Suppose you have an address book saved as a CSV text file or an XML file. You would like to create a program that loads the address book into memory, and return the contact details of a person, given the name.
There are a number of data structures from which to choose: array, ArrayList, List<T>, Hashtable, etc. If there are only a few hundred entries in your address book, then you could choose any data structure that suits your fancy. But if the number of entries range in the hundreds of thousands, then you need to choose the best data structure and algorithm for fast retrieval. You certainly don’t want to create a foreach loop to search through your list.
(Side note: List<T> has the Find(Predicate<T>) method, so you don’t need to create an explicit foreach loop, but behind the scenes, your predicate is being called multiple times. To verify this, you can put a breakpoint on your predicate method in Visual Studio. My friend DiRN calls this “looping without looping”).
Which is why hash tables were invented. Hash tables take advantage of the fact that array lookup is very fast, if you supply the index: by using the square bracket notation, the program would be able to go straight to the given memory location.
Going back to the address book example, if we have a function to transform the key (which is the name of the person) into an integer, then we would be able to use an array as our data structure. To store and retrieve the contact details of John Smith, we would simply apply this function to the string “John Smith”, and use the integer returned by our function as the index of the array. In computer science, this type of function is called a hash function.
The problem with hash functions is hash collision, where the function returns the same integer for two or more different inputs. What do we do if our function above returns the same index for both John Smith and Ramasamy Chandran? There are two solutions to this: chaining and open addressing. By chaining, each element in the array is a linked list, so that a single array location can contain more than one piece of data. Open addressing involves using alternate locations in the array. All in all, it’s quite messy.
Fortunately the Hashtable class does all the work for us. Straight out of the box, Hashtable is very simple to use:
using System.Collections; // insert items into the Hashtable: Hashtable ht = new Hashtable; ht.Add(key1, obj1); ht.Add(key2, obj2); // etc. // retrieval: object obj42 = ht[key42];
Here’s the catch: for best performance, you need to override the default implementation of GetHashCode(), i.e., the hash function. The default implementation of GetHashCode() for reference types is correct but not necessarily efficient; for value types, it is not necessarily efficient and might even be incorrect. (Even then, overriding Object.operator==() can break Object.GetHashCode(). ValueType.GetHashCode() is correct only for structs whose first field is read-only.)
So, what’s the best hash function? Well, if one were to exist, surely it would already be included in the default implementation of GetHashCode(). Constructing a correct and efficient hash function requires extensive knowledge of the type. The root object does not have this advantage, which is why the default implementation is naive and simplistic.
In conclusion, even without overriding the GetHashCode() method of your key object, the Hashtable gives you the fastest retrieval [1], as long as the following conditions are met:
1. If your key is a reference type, ensure that you did not override the == operator.
2. If your key is a value type, ensure that the first field in your struct is read-only.
Otherwise you need to override the default implementation of GetHashCode().
Notes
[1] “The Hashtable and any related blog entries is provided ‘as is’ without warranty of any kind, either express or implied, including, without limitation, the implied warranties or merchantability, or fitness for a particular purpose. The entire risk arising out of use or performance of the Hashtable remains with you.”
One Response to “C# Hidden Danger #2: GetHashCode()”
1 zeff 26 August 2008 @ 12:03 pm
In JavaScript and ActionScript world, it’s called Associative Array
Comments: