Hashing
Tom Kelliher, CS23
May 8, 1996
The general idea of hashing:
- Data is categorized by a search key.
- Data is kept in a table (open or closed).
- Number of search keys
>>
table size.
- Amount of active data
<
table size.
- Hash function evenly distributes data over table.
- Insert, retrieve, delete require O(1) time.
- Ordered operations usually slow.
A hash function, h(), takes an arbitrary integer, j, and maps into an
integer in a specified range, usually 0 to p-1 where p is prime.
h() should have these properties:
- Easy and fast to compute.
- Place data evenly throughout the table.
- How well does h() scatter random keys?
- How well does h() scatter non-random key?
- Perfect hash function --- possible if you know all keys in advance.
- Otherwise, we have collisions. How can we prove this?
- Collision resolution.
Sampler of h()'s
( Always possible to convert a key to an integer/integers):
- Selecting digits --- which digits?
- Folding --- add digits.
- Modulo Arithmetic.
- Hash table is an array of data holders.
- h() gives first possible position for key (no collision).
- Probe sequence used on collisions.
- Table element states:
- Never used.
- In use.
- Previously used (``plug'' holes in probe sequence left by
deletes).
- Clustering.
Some probe sequences:
- Linear probing.
- Quadratic probing.
- Double hashing --- initial probe position, increment. Table size,
increment should be relatively prime.
(Separate chaining.)
Hash table is array of pointers (array of linked lists.)
- Hash tables have excellent performance on average.
- Use a balanced search tree when performance must be guaranteed.
- Time wise, separate chaining superior to linear probing, quadratic
probing, or double hashing.
- Pointer overhead involved with chaining.
Hashing randomizes the keys.
Poor performance for retrieving:
- keys in sorted order.
- min, max key.
- keys over a range.
Thomas P. Kelliher
Tue May 7 13:13:00 EDT 1996
Tom Kelliher