Can we do better than logarithmic performance for sets and dictionaries? Can we...possibly...get to constant time?
We can’t do better than logarithmic if we have to compare values against each other. So, in much the same way we “improved” upon the comparison-based sorts by using distribution (bucket / radix) techniques, we can apply a similar idea to search.
You define a hash function $h$ that maps your keys into integers in the range $0 ... N-1$. Each key $k$ then goes into bucket $h(k)$.
Let:
String
,
Then if we put in the values {"hashtables", "will", "usually", "execute", "constant", "time"}
the slots get computed like this:
h("hashtables") = ('h' * 's') % 11 = 3 h("will") = ('w' * 'l') % 11 = 4 h("usually") = ('u' * 'y') % 11 = 0 h("execute") = ('e' * 'e') % 11 = 4 h("constant") = ('c' * 't') % 11 = 0 h("time") = ('t' * 'e') % 11 = 1
Note in general you get collisions, which need to be resolved. There are many collision resolution strategies; one of the most common is just to chain the elements in a linked list. If we did that our hashtable would look like this:
If every element hashed to the same slot, then the hashtable would have linear, not constant, performance. So you need to be smart about:
This is hard in general, and is application dependent. It’s also beyond the scope of these notes. Start with Wikipedia’s Hashtable article and then do your own research.
Hashtables perform pretty well until they get about 70% full (i.e. a load factor of 0.7). After this you often want to make the table bigger and rehash everything. This takes time, but is worth it when you do more lookups then inserts.
Ditto for deletions. You might want to shrink the hashtable if it has been taking up too much memory.
Many strategies are known
Find details in Wikipedia’s Hashtable article.
Hashtables are good because:
Hashtables aren’t always the best choice because:
There’s so much more in Wikipedia’s Hashtable article.
We’ve covered: