Though hash table is one of the most efficient data structures, it has some weaknesses.

For any hash function $\mathbf{h}$, a set of keys exist that can cause the average access time of a hash table to skyrocket. I.e. all keys from $\{k\in\mathcal{U} : \mathbf{h}(k) = i\}$ for some slot $i$.

Approach:

Choose the hash function at random, independently of the keys.

Universal Hashing

Definition

Let $\mathcal{U}$ be a universe of keys, and let $\mathcal{H}$ be a finite collection of hash functions, each mapping $\mathcal{U}$ to $\{0, 1, \cdots, m - 1\}$. We say $\mathcal{H}$ is universal if $\forall x, y \in \mathcal{U}$, where $x \neq y$, we have $\vert \{\mathbf{h}\in\mathcal{H} : \mathbf{h}(x) = \mathbf{h}(y)\} \vert = \frac{\vert\mathcal{H}\vert}{m}$.
That is, the probability of a collision between $x$ and $y$ is $\frac{1}{m}$ if we choose hash function $\mathbf{h}$ randomly from $\mathcal{H}$.

Theorem

Let $\mathbf{h}$ be a hash function chosen uniformly from a universal set $\mathcal{H}$ of hash functions. Suppose $\mathbf{h}$ is used to hash $n$ arbitrary keys into the $m$ slots of a table $T$. Then, for a given key $x$, we have
\[E[\#\text{collisions with x}] < \frac{n}{m}\]

Proof

Let random variable $C_x$ = the total number of collisions of keys in $T$ with $x$. And let indicator r.v. $c_{xy}$ denotes whether $\mathbf{h}(x) = \mathbf{h}(y)$.
\[
c_{xy} =
\begin{cases}
1\text{ if }\mathbf{h}(x) = \mathbf{h}(y),\newline
0\text{ otherwise}
\end{cases}
\]
And it is easy to get:
\[
\begin{align*}
E[c_{xy}] &= \frac{1}{m}\newline
C_x &= \sum_{y\in T - \{x\}}c_{xy}
\end{align*}
\]
So,
\[
\begin{align*}
E[C_x] &= E\left[\sum_{y\in T - \{x\}}c_{xy}\right]\newline
&= \sum_{y\in T - \{x\}}E\left[c_{xy}\right]\newline
&= \sum_{y\in T - \{x\}}\frac{1}{m}\newline
&= \frac{n - 1}{m}\newline
&< \frac{n}{m}
\end{align*}
\]

Constructing universal hash functions

Let $m$ be prime. Decompose key $k$ into $r + 1$ digits, each with value in the set $\{0, 1, \cdots, m - 1\}$. That is, let $k = \langle k_0, k_1, \cdots, k_r\rangle$, where $0 \leq k_i < m$.
Pick $a = \langle a_0, a_1, \cdots, a_r\rangle$ where each $a_i$ is chosen randomly from $\{0, 1, \cdots, m - 1\}$.
Define $\mathbf{h}_a(k) = \sum_{i = 0}^ra_ik_i\mod m$. And $\mathcal{H} = \{\mathbf{h}_a\}$, $\vert\mathcal{H}\vert = m^{r + 1}$.

Theorem

The set $\mathcal{H} = \{\mathbf{h}_a\}$ is universal.

Proof

Suppose that $x = \langle x_0, x_1, \cdots, x_r\rangle$, and $y = \langle y_0, y_1, \cdots, y_r\rangle$ be distinct keys. Thus, they differ in at least one digit position, without loss of generality, position 0. $x$ and $y$ collide, so we must have $\mathbf{h}_a(x) = \mathbf{h}_a(y)$, which implies that
\[\sum_{i = 0}^ra_ix_i \equiv \sum_{i = 0}^ra_iy_i ~(\mathrm{mod}~m)\]
Equivalently, we have
\[\sum_{i = 0}^ra_i(x_i - y_i)\equiv 0 ~(\mathrm{mod}~m)\]
\[a_0(x_0 - y_0) \equiv -\sum_{i = 1}^ra_i(x_i - y_i) ~(\mathrm{mod}~m)\]

Theorem

Let $m$ be prime. $\forall z\in\mathcal{Z}_m$ such that $z\neq 0$, there exists a unique $z^{-1}\in\mathcal{Z}_m$ such that
\[z\cdot z^{-1}\equiv 1 ~(\mathrm{mod}~m)\]
Then back to the proof, we have
\[a_0\equiv\left(-\sum_{i = 1}^ra_i(x_i - y_i)\right)\cdot(x_0 - y_0)^{-1} ~(\mathrm{mod}~m)\]
Thus, for any choices of $a_1, a_2, \cdots, a_r$, exactly one choice of $a_0$ causes $x$ and $y$ to collide.
So the number of $\mathbf{h}_a$’s that cause $x$ and $y$ to collide is $m^r\cdot 1 = m^r = \frac{\vert\mathcal{H}\vert}{m}$.

Perfect Hashing

Given a set of $n$ keys, construct a static hash table of size $m = O(n)$ such that SEARCH takes $\Theta(1)$ time in the worst case.

Approach:

Two-level scheme with universal hashing at both levels.

Perfect Hashing

Theorem

Let $\mathcal{H}$ be a class of universal hash functions for a table of size $m = n^2$. Then, we use a random $\mathbf{h} \in \mathcal{H}$ to hash $n$ keys into the table, the expected number of collisions is at most 1/2.

Proof

By the definition of universality, the probability that 2 given keys in the table collide under $\mathbf{h}$ is $1/m = 1/n^2$. Since there are $\binom{n}{2}$ pairs of keys that can possibly collide, the expected number of collision is
\[\binom{n}{2}\cdot\frac{1}{n^2} = \frac{n(n - 1)}{2}\cdot\frac{1}{n^2} < \frac{1}{2}\]

Corollary

The probability of no collisions is at least 1/2.

Proof

Markov’s inequality says that for any nonnegative r.v. $X$, we have
\[P\{X\geq t\}\leq\frac{E[X]}{t}\]
Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2.

Analysis of storage

For the level-1 hash table $T$, choose $m = n $, and let $n_i$ be r.v. for the number of keys that hash to slot $i$ in $T$. By using $n_i^2$ slots for the level-2 hash table $S_i$, the expected total storage required for the two-level scheme is therefore
\[E\left[\sum_{i = 0}^{m - 1}\Theta(n_i^2)\right] = \Theta(n)\]

Proof

The neat trick is that one way to count this quantity is to count the number of ordered pairs that collide, including an element colliding with itself. So, we have:
\[
\begin{align*}
\sum_{i = 0}^{m - 1}n_i^2 &= \sum_x\sum_yc_{xy} \text{ ($c_{xy} = 1$ if x and y collide, else $c_{xy} = 0$)}\newline
&= n + \sum_x\sum_{y \neq x}c_{xy}\newline
&\leq n + n(n - 1)/m\newline
&< 2n
\end{align*}
\]