# Hashing

Everything you need to know about probability

Linearity of expectation

Indicator variables

Independent events

Product rule

Markov inequality

## Hashing

Dictionaries

Operations.

makeset, insert, delete, find

Model

keys are integers in \(M=\set{1,\ldots,m}\)

(so assume machine word size, or “unit time,” is \(\log m\))

can store in array of size \(M\)

using power: arithmetic, indirect addressing

(more than bucket based heaps that use indirection without arithmetic)

compare to comparison and pointer based sorting, binary trees

problem: space.

Hashing:

find function \(h\) mapping \(M\) into table of size \(s \ll m\)

use it to put \(n\) items from \(M\) in the table

so presumably \(s \ge n\)

Still, some items can get mapped to same place: “collision”

use linked list etc.

search, insert cost equals size of linked list

goal: keep linked lists small: few collisions

Hash families:

problem: for any hash function, some bad input (if space \(s\), then \(m/s\) items to same bucket)

This true even if hash is e.g. SHA1

Solution: build family of functions, choose one that works well

Set of all functions?

Idea: choose “function” that stores items in sorted order without collisions

problem: to evaluate function, must examine data

evaluation time \(\Omega(\log n)\).

“description size” \(\Omega(n \log m)\),

Better goal: choose function without looking at data (except query key)

How about a random function?

set \(N\) of \(n\) items

If \(s=n\), balls in bins

\(O((\log n)/(\log\log n))\) collisions w.h.p.

And matches that somewhere

but we may care more about

*average*collisions over many operations\(C_{ij}=1\) if \(i,j\) collide

Time to find \(i\) is \(1+\sum_j C_{ij}\)

expected value \(1+(n-1)/n \le 1\)

more generally expected search time for item (present or not): \(O(n/s)=O(1)\) if \(s=n\)

Problem:

\(n^m\) functions (specify one of \(n\) places for each of \(n\) items)

too much space to specify (\(m\log n\)),

hard to evaluate

for \(O(1)\) search time, want to evaluate function in \(O(1)\) time.

so function description must fit in \(O(1)\) machine words

Assuming \(\log m\) bit words

So, fixed number of cells can only distinguish poly\((m)\) functions

This bounds size of hash family we can choose from

2-universal family: [Carter-Wegman]

Key insight: don’t need entirely random function

All we care about is which pairs of items collide

so: OK if items land

*pairwise independent*pick prime (or prime power) \(p\) in range \(m,\ldots,2m\) (not random)

pick random \(a,b\)

map \(x\) to \((ax + b \bmod p)\bmod m\)

fix \(x\), \(y\)

mapping pairwise independent, uniform before \(\mod m\) \[\begin{aligned} ax+1\cdot b &= s\\ ay+1\cdot b &= t\\ \end{aligned}\] matrix \(\binom{x\quad 1}{y \quad 1}\) determinant \(x-y \ne 0\), so unique solution

So pairwise independent, near-uniform after \(\mod m\)

at most 2 “uniform buckets” to same place

argument above holds: \(O(1)\) expected search time.

represent with two \(O(\log m)\)-bit integers: hash family of poly size.

*max*load may be large as \(\sqrt{n}\), but who cares?expected load in a bin is 1

so \(O(\sqrt{n})\) with prob. 1-1/n (chebyshev).

this bounds expected max-load

some item may have bad load, but unlikely to be the requested one

can show the max load is probably achieved for some 2-universal families

## perfect hash families

Fredman Komlos Szemeredi.

Ideally, would hash with *no* collisions

Get

*worst case*\(O(1)\) lookupsExplore case of fixed set of \(n\) items (read only)

perfect hash function: no collisions

Even fully random function of \(n\) to \(n\) has collisions

Alternative try: use more space:

How small can \(s\) be for random \(n\) to \(s\) without collisions?

Expected number of collisions is \(E[\sum C_{ij}] = \binom{n}{2}(1/s) \approx n^2/2s\)

**Markov Inequality:**\(n=\sqrt{s}\) works with prob. 1/2Nonzero probability, so, 2-universal hashes can work in quadratic space.

Is this best possible?

Birthday problem: \((1-1/s)\cdots(1-n/s) \approx e^{-(1/s+2/s+\cdots+n/s)} \approx e^{-n^2/2s}\)

So, when \(n=\sqrt{s}\) has \(\Omega(1)\) chance of collision

23 for birthdays

even for fully independent

Finding one

We know one exists—how find it?

Try till succeed

Each time, succeed with probability 1/2

Expected number of tries to succeed is 2

so expect \(O(n)\) work

Probability need \(k\) tries is \(2^{-k}\)

Two level hashing for linear space

Hash \(n\) items into \(O(n)\) space 2-universally

Build quadratic size hash table on contents of each bucket

bound \(\sum b_k^2=\sum_k (\sum_i [i \in b_k])^2 =\sum C_i + C_{ij}\)

expected value \(O(n)\).

So try till get (markov)

Then build collision-free quadratic tables inside

Try till get

Polynomial time in \(n\), Las-vegas algorithm

Easy: \(6n\) cells

Hard: \(n+o(n)\) cells (bit fiddling)

Define las vegas, compare to monte carlo.

Derandomization

Probability \(1/2\) top-level function works

Only \(m^2\) top-level functions

Try them all!

Polynomial in \(m\) (not \(n\)), deterministic algorithm

Followups

Dynamic Perfect Hashing