Hashing
Everything you need to know about probability
Linearity of expectation
Indicator variables
Independent events
Product rule
Markov inequality
Hashing
Dictionaries
Operations.
makeset, insert, delete, find
Model
keys are integers in \(M=\set{1,\ldots,m}\)
(so assume machine word size, or “unit time,” is \(\log m\))
can store in array of size \(M\)
using power: arithmetic, indirect addressing
(more than bucket based heaps that use indirection without arithmetic)
compare to comparison and pointer based sorting, binary trees
problem: space.
Hashing:
find function \(h\) mapping \(M\) into table of size \(s \ll m\)
use it to put \(n\) items from \(M\) in the table
so presumably \(s \ge n\)
Still, some items can get mapped to same place: “collision”
use linked list etc.
search, insert cost equals size of linked list
goal: keep linked lists small: few collisions
Hash families:
problem: for any hash function, some bad input (if space \(s\), then \(m/s\) items to same bucket)
This true even if hash is e.g. SHA1
Solution: build family of functions, choose one that works well
Set of all functions?
Idea: choose “function” that stores items in sorted order without collisions
problem: to evaluate function, must examine data
evaluation time \(\Omega(\log n)\).
“description size” \(\Omega(n \log m)\),
Better goal: choose function without looking at data (except query key)
How about a random function?
set \(N\) of \(n\) items
If \(s=n\), balls in bins
\(O((\log n)/(\log\log n))\) collisions w.h.p.
And matches that somewhere
but we may care more about average collisions over many operations
\(C_{ij}=1\) if \(i,j\) collide
Time to find \(i\) is \(1+\sum_j C_{ij}\)
expected value \(1+(n-1)/n \le 1\)
more generally expected search time for item (present or not): \(O(n/s)=O(1)\) if \(s=n\)
Problem:
\(n^m\) functions (specify one of \(n\) places for each of \(n\) items)
too much space to specify (\(m\log n\)),
hard to evaluate
for \(O(1)\) search time, want to evaluate function in \(O(1)\) time.
so function description must fit in \(O(1)\) machine words
Assuming \(\log m\) bit words
So, fixed number of cells can only distinguish poly\((m)\) functions
This bounds size of hash family we can choose from
2-universal family: [Carter-Wegman]
Key insight: don’t need entirely random function
All we care about is which pairs of items collide
so: OK if items land pairwise independent
pick prime (or prime power) \(p\) in range \(m,\ldots,2m\) (not random)
pick random \(a,b\)
map \(x\) to \((ax + b \bmod p)\bmod m\)
fix \(x\), \(y\)
mapping pairwise independent, uniform before \(\mod m\) \[\begin{aligned} ax+1\cdot b &= s\\ ay+1\cdot b &= t\\ \end{aligned}\] matrix \(\binom{x\quad 1}{y \quad 1}\) determinant \(x-y \ne 0\), so unique solution
So pairwise independent, near-uniform after \(\mod m\)
at most 2 “uniform buckets” to same place
argument above holds: \(O(1)\) expected search time.
represent with two \(O(\log m)\)-bit integers: hash family of poly size.
max load may be large as \(\sqrt{n}\), but who cares?
expected load in a bin is 1
so \(O(\sqrt{n})\) with prob. 1-1/n (chebyshev).
this bounds expected max-load
some item may have bad load, but unlikely to be the requested one
can show the max load is probably achieved for some 2-universal families
perfect hash families
Fredman Komlos Szemeredi.
Ideally, would hash with no collisions
Get worst case \(O(1)\) lookups
Explore case of fixed set of \(n\) items (read only)
perfect hash function: no collisions
Even fully random function of \(n\) to \(n\) has collisions
Alternative try: use more space:
How small can \(s\) be for random \(n\) to \(s\) without collisions?
Expected number of collisions is \(E[\sum C_{ij}] = \binom{n}{2}(1/s) \approx n^2/2s\)
Markov Inequality: \(n=\sqrt{s}\) works with prob. 1/2
Nonzero probability, so, 2-universal hashes can work in quadratic space.
Is this best possible?
Birthday problem: \((1-1/s)\cdots(1-n/s) \approx e^{-(1/s+2/s+\cdots+n/s)} \approx e^{-n^2/2s}\)
So, when \(n=\sqrt{s}\) has \(\Omega(1)\) chance of collision
23 for birthdays
even for fully independent
Finding one
We know one exists—how find it?
Try till succeed
Each time, succeed with probability 1/2
Expected number of tries to succeed is 2
so expect \(O(n)\) work
Probability need \(k\) tries is \(2^{-k}\)
Two level hashing for linear space
Hash \(n\) items into \(O(n)\) space 2-universally
Build quadratic size hash table on contents of each bucket
bound \(\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}\)
expected value \(O(n)\).
So try till get (markov)
Then build collision-free quadratic tables inside
Try till get
Polynomial time in \(n\), Las-vegas algorithm
Easy: \(6n\) cells
Hard: \(n+o(n)\) cells (bit fiddling)
Define las vegas, compare to monte carlo.
Derandomization
Probability \(1/2\) top-level function works
Only \(m^2\) top-level functions
Try them all!
Polynomial in \(m\) (not \(n\)), deterministic algorithm
Followups
Dynamic Perfect Hashing