Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Hash??? Not quite clear on what this is...

315 views
Skip to first unread message

Charles Tryon

unread,
Dec 3, 1990, 9:46:12 AM12/3/90
to inews
in <8...@compnect.UUCP>, john...@compnect.UUCP (John Core ) writes:
> > av...@netcom.UUCP (Avery Colter) writes:
> >
> > >I'm seeing all this talk of "Hash Functions".
> >
> College programming courses seem to never deal with the REAL (ie buisness)
> world. ...
>
> --just another old fart telling a war story...

Well, I never learned much about hashing when I went to school (more years
ago than I want to think about ;-). I know the concept of hashing, but have
never had a use for it in the applications I have worked on. My question
is, are there any recomendations out there for a good, fairly basic book on
hashing? I don't need to know all the gory details and the latest hot-shot
methods, just the basics.

--
Chuck Tryon
<bi...@bisco.kodak.com>
USmail: 46 Post Ave.;Roch. NY 14619 B. Baggins
<<...include standard disclamer...>> At Your Service

"Swagger knows no upper bound, but the laws of physics remain unimpressed."
(D. Mocsny)

Dan Bernstein

unread,
Dec 3, 1990, 4:54:43 PM12/3/90
to
In article <901203144...@bisco.kodak.COM> bi...@bisco.kodak.COM (Charles Tryon) writes:
> My question
> is, are there any recomendations out there for a good, fairly basic book on
> hashing? I don't need to know all the gory details and the latest hot-shot
> methods, just the basics.

Knuth spends a section on hashing. The presentation is good and has all
the basics, but doesn't go into the latest hot-shot methods.

---Dan

Tim Lynch

unread,
Dec 4, 1990, 9:20:17 AM12/4/90
to

Thanks for the reference. Can someone recommend a book that does go into
all the latest "hot-shot" methods?

Tim Lynch

Dan Salomon

unread,
Dec 4, 1990, 9:48:09 PM12/4/90
to
In article <901203144...@bisco.kodak.COM> bi...@bisco.kodak.COM (Charles Tryon) writes:
> ... My question

> is, are there any recomendations out there for a good, fairly basic book on
> hashing? I don't need to know all the gory details and the latest hot-shot
> methods, just the basics.

Almost any university text on data structures will have a section on hash
searching. See for instance:

"Fundamentals of Data Structures" by Horowitz and Sahni
"Data Structures Techniques" by Standish
"Data Structures" by Reingold & Hansen

The first is the most readable of the three, and the last is the most
complete. Hashing can be used on almost any computerized search
problem, not just string searching problems, and usually performs very
well. I have even used hashing to search for matching states in a
parsing automaton. But it may take some imagination and skill to apply
it to your specific problem.
--

Dan Salomon -- sal...@ccu.UManitoba.CA
Dept. of Computer Science / University of Manitoba
Winnipeg, Manitoba, Canada R3T 2N2 / (204) 275-6682

Dan Bernstein

unread,
Dec 4, 1990, 7:37:28 PM12/4/90
to
In article <1990Dec4.1...@batcomputer.tn.cornell.edu> ly...@batcomputer.tn.cornell.edu (Tim Lynch) writes:
> Thanks for the reference. Can someone recommend a book that does go into
> all the latest "hot-shot" methods?

Have there been any fundamentally new hashing methods in the last twenty
years? The only one I know of is Pearson's.

What are you really looking for? There are lots of string functions that
are both fast and, in practice, better than random hashing. These days I
start from h = 5381, and set h = h << 5 + h + c mod any power of 2 for
each new character c. Apparently Chris Torek prefers a multiplier of 31:
h << 5 - h + c. These are reliable and extremely fast. You can make them
even faster if you want to hash an entire string at once: just
precompute powers of 31 or 33, etc.

---Dan

Colin Plumb

unread,
Dec 15, 1990, 7:14:05 PM12/15/90
to
In article <901203144...@bisco.kodak.COM> bi...@bisco.kodak.COM (Charles Tryon) writes:
> Well, I never learned much about hashing when I went to school (more years
> ago than I want to think about ;-). I know the concept of hashing, but have
> never had a use for it in the applications I have worked on. My question
> is, are there any recomendations out there for a good, fairly basic book on
> hashing? I don't need to know all the gory details and the latest hot-shot
> methods, just the basics.

Hashing: taking a long key (often a string) and returning a short key (often
an integer in some range) that tries to be different for every long key.
This is obviously impossible in general, and you get "collisions" where
different long keys map to identical short keys.

For a good hash function, collisions are no more likely than if you used random
numbers. If you know all the input keys ahead of time, you can work out a hash
function that has no collisions. If every short key is used (the N possible
long keys hash to 0..N-1), you have a "perfect hash."

A common application of hash functions is hash tables, where you want to store
some records indexed by long key and get access to them quickly. So you keep
an array of records and index it by the hash function to find the record you
want. Handling collisions can be done in various ways. I like "chaining"
where you make each entry a linked list and do linear search along them.
There's also extensible hashing, where you copy everything to a bigger table
and arrange things so there are no collisions. Then there's linear probing,
useful for tables that never need deletion. You search forwards from the
hash position until you find an empty slot. If you're writing, put it here.
If you're reading, this means that the value in question has never been
written. Simple, but its performance gets Very Bad as the table fills up.

Then there's double hashing, where you search forwards in steps of a second
hashing function, and otherwise works the same (this works very well), uniform
hashing (where the hash function produces a permutation of the table buckets
to search in - double hashing is a very close approximation), and other tricks.

More unusual applications involve traditional indexing techniques (you don't
fill up the space as much), but you store a hash of the key because it's more
compact and quicker to compare. Diff uses this to compare lines... it hashes
a line of text and compares them first, then double-checks with a full text
compare if the lines are equal.

Another interesting variant was used in a version of spell(1), although I don't
know if it's the current one or not. The dictionary of correctly spelled words
was hashed into a large hash space (27 bits?) and the result stored as a
bitmap. No, not as 2^24 = 16 Megabytes, but the gaps between adjacent bits,
with some higher-level indices. that let you check if bit #4392859 is set
fairly quickly. This has the possibility of missing a misspelled word, but
it's very rare and a "stop list" of words it fails to catch was added to
deal with exceptions that actually were noticed in practice.

For integers to integers, input%outputrange works fairly well if you arrange
for outputrange to be prime. For strings, hash = fiddle(hash)+*p++ is popular.
Someone just posted fiddle(x) = x*31 and x*33. (You range reduce mod 2^n
for suitable n eventually.)

One that was published in CACM recently was to set up a table with a random
permutation of 0..255 and loop over hash = table[hash^*p++];. If you set up
the table more methodically, you get an 8-bit CRC of the input string, but
any random permutation works well.

Knuth (section 6.4) mentions multiplicative hashing (take some of the middle
bits of constant*input).

That was quick, but that's the guts of the thing.

0 new messages