Coursera

Hash Table

Purpose: maintain a (possibly evolving) set of stuff.

Using a “key”:

Amazing guarantee: all operations in $\mathcal{O}(1)$ time if

  1. properly implemented
  2. non-pathological data

Application: 2-SUM problem. Input: unsorted array $A$ of $n$ integers, goal: determine whether or not there are two numbers $x, y \in A$ with $x + y = t$

Naïve solution: $\Theta(n^2)$ time via brute-force

Better:

  1. Sort $A$ - $\Theta(n\log{n})$ time
  2. For each $x \in A$, look for $t - x$ in $A$ via binary search.

Amazing:

  1. Insert elements of $A$ into hash table $H$
  2. For each $x$ in $A$, lookup $t - x$ in $H$.

Implementations

Setup: universe $\mathcal{U}$ - generally really big

Goal: want to maintain evolving set $S \subseteq \mathcal{U}$ (generally, of reasonable size).

Naïve solutions:

  1. Array-based solution (indexed by $u$) - $\mathcal{O}(1)$ space but $\mathcal{\Theta(\mid \mathcal{U}\mid)}$ size.
  2. List-based solution - $\mathcal{O}(\mid S\mid)$ subspace but $\Theta(\mid S\mid)$ lookup

Solution:

  1. Pick $n$ is the number of “buckets” with $n \approx \mid S\mid$
  2. Choose a hash function $h: \mathcal{U}\mapsto {0, 1, 2, .., n - 1}$
  3. Use array $A$ of length $n$, store $x$ in $A[h(x)]$

Quiz: consider $n$ people with random birthdays (i.e. with each day of the year equally likely). How large does $n$ need to be before there is at least a 50% chance that two people have the same birthday? Answer: 23

Resolving Collisions

Collisions: distinct $x, y \in \mathcal{U}$ such that $h(x) = h(y)$.

Solution 1: (separate) chaining

Solution 2: open addressing (only one object per bucket)

What makes a good hash function?

Note: in hash table with chaining, Insert is $\mathcal{O}(1)$ - insert new object $x$ at front of list in $A[h(x)]$, $\mathcal{O}(\text{list length})$ (could be anywhere from $\frac{m}{n}$ - equal-length lists to $m$ - all objects in the same bucket for $m$ objects) for Insert/Delete

Point: performance depends on the choice of hash function!

Properties of a good “hash” function:

  1. Should lead to good performance i.e. should “spread data out” (gold standard: completely random hashing)
  2. Should be easy to store/very fast to evaluate

Bad hash functions

Example: keys = phone numbers (10-digit), $\mathcal{U} = 10^{10}$. Choose $n = 10^3$

Quick-and-dirty hash functions

Objects ->(hash code)->Integers ->(compression function)->Buckets

How to choose $n$ - number of buckets:

  1. Choose $n$ to be a prime (within constant factor of number of objects in table)
  2. Not too close to a power of 2
  3. Not too close to a power of 10

The load of a hash table

Definition: the load factor of a hash table is

$$ \alpha = \frac{\text{number of objects in a hash table}}{\text{number of buckets in the hash table}} $$

Quiz: what hash table implementation strategy is feasible for load factors larger than 1? Answer: Only chaining

Note:

  1. $\alpha \in \mathcal{O}(1)$ is necessary condition for operations to run in constant time.
  2. With open addressing, need $\alpha \ll 1$

Upshot:

  1. for good hash table performance, need to control load.
  2. for good hash table performance, need a good hash function i.e. spreads data evenly across buckets.

Idea: use super-clear hash function guaranteed to spread every data set out evenly

Problem: does not exist! (for every hash function, there is a pathological data set).

Reason: fix a hash function $h: \mathcal{U}\mapsto {0, 1, 2, .., n - 1}$. A la pigeonhole principle: there is a bucket $i$ such that at least $\frac{\mid \mathcal{U}\mid}{n}$ elements of $\mathcal{U}$ hash to $i$ under $h$.

Solutions:

  1. Use a cryptographic hash function (e.g. SHA-2) - infeasible to reverse engineer a pathological data set.
  2. Use randomization: design a family $\mathcal{H}$ of hash functions such that, for all data sets $\mathcal{S}$, almost all functions $h \in \mathcal{H}$ spread $\mathcal{S}$ “pretty evenly” (compose to Quick Sort guarantee).

Universal Hashing

Definition: let $\mathcal{H}$ be a set of hash functions from $\mathcal{U}$ to ${0, 1, 2, .., n - 1}$. $\mathcal{H}$ is universal if and only if:

$$ \forall x, y \in \mathcal{U}, x \neq y, \mathbb{P}_{h \in \mathcal{H}}[\text{$x, y$ collide (i.e. $h(x) = h(y)$)}] \leq \frac{1}{n} $$ where $n$ is number of buckets. When $h$ is chosen uniformly at random from $\mathcal{H}$ (i.e. collision probability as small as with “gold standard” of perfectly random hashing).

Quiz: Consider a hash function family $\mathcal{H}$, where each hash function of $\mathcal{H}$ maps elements from a universe $\mathcal{U}$ to one of $n$ buckets. Suppose $\mathcal{H}$ has the following property: for every bucket $i$ and key $k$, a $\frac{1}{n}$ fraction of the hash function of $\mathcal{H}$ map $k$ to $i$. Is $\mathcal{H}$ universal? Answer: maybe yes, maybe no, depending on the $\mathcal{H}$:

Example: hashing IP address

Let $\mathcal{U}$ = IP addresses of the form $(x_1, x_2, x_3, x_4)$ with $x_i \in {0, 1, 2, .., 255}$. Let $n$ be a prime (e.g. small multiple of number of objects in the hash table).

Construction: define one hash function on $h_a$ per 4-tuple $a = (a_1, a_2, a_3, a_4)$ with each $a_i \in {0, 1, .., n - 1} - n^4$ such functions.

Define: $h_a$: IP addresses $\mapsto$ buckets by $h_a(x_1, x_2, x_3, x_4) = (a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4)\pmod n$

A universal hash function

Define: $\mathcal{H} = {h_a \mid a_1, a_2, a_3, a_4 \in {0, 1, .., n - 1}}$

Theorem: this family is universal.

Proof

Consider distinct IP addresses $(x_1, x_2, x_3, x_4)$ and $(y_1, y_2, y_3, y_4)$. Assume that $x_4\neq y_4$, collision probability (i.e. $\mathbb{P}_{h_a \in \mathcal{H}}[h_a(x_1, x_2, x_3, x_4) = h_a(y_1, y_2, y_3, y_4)]$?

Note: collision

$$ \begin{aligned} &\iff a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 \pmod n \equiv a_1y_1 + a_2y_2 + a_3y_3 + a_4y_4 \pmod n\ &\iff a_4(x_4 - y_4) \pmod n \equiv \sum_{i = 1}^3 a_i(y_i - x_i) \pmod n \end{aligned} $$ Next: condition on random choices of $a_1, a_2, a_3$ (with $a_4$ still random).

Key claim: left hand side equally likely to be any of ${0, 1, 2, .., n - 1}$. Reason: $x_4 \neq y_4 (x_4 - y_4) \not\equiv 0 \pmod n$, $n$ is prime, $a_4$ random at uniform.

Addendum: make sure $n$ is bigger than the maximum value of an $a_i$

Proof by example: let $n = 7, x_4-y_4 \equiv 2$ or $3 \pmod n$

Hence, this implies

$$\mathbb{P}_{h_a \in \mathcal{H}}[h_a(x) = h_a(y)] = \frac{1}{n}$$

Analysis of Chaining

Constant-time guarantee

Scenario: hash table implemented with chaining. Hash function $h$ chosen uniformly random from universal family $\mathcal{H}$.

Theorem (Carter-Wegman 1979): all operations run in $\mathcal{O}(1)$ time.

Caveats:

  1. in expectation over random choice of the hash function $h$.
  2. assume $\mid S\mid \in \mathcal{O}(n)$ i.e. load $\alpha = \frac{\mid S\mid}{n} \in \mathcal{O}(1)$, where $n$ is number of buckets
  3. assume takes $\mathcal{O}(1)$ time to evaluate hash function.

Proof

Will analyze an unsuccessful lookup (other operations only faster).

Let $S$ be the data set with $\mid S\mid \in \mathcal{O}(n)$. Consider lookup for $x \notin S$

Running time: $\mathcal{O}(1)$ (compute $h(x)$) + $\mathcal{O}(\text{list length in } A[h(x)])$ (traverse list, where the list length is a random variable depends on the choice of function $h$)

Let $L$ be the list length in $A[h(x)]$. For $y \in S$ (so $x \neq y$), define

$$ Z_y = \begin{cases} 1 &\text{if} &h(y) = h(x)\ 0 &\text{otherwise} \end{cases} $$ Note: $$L = \sum_{y \in S}Z_y$$ Hence, $$\mathbb{E}[L] = \sum_{y \in S} \mathbb{E}[Z_y]$$, and $$\mathbb{E}[Z_y] = \mathbb{P}[h(x) = h(y)]$$ Thus, $$\mathbb{E}[L] = \sum_{y \in S} \mathbb{E}[Z_y]=\sum_{y \in S}\mathbb{P}[h(y)=h(x)]\leq \sum_{y \in S} \frac{1}{n}=\frac{\mid S\mid}{n}$$ which is load $\alpha \in \mathcal{O}(1)$

Hash table performance with Open Addressing

Recall: one object per slot. hash function produces a probe sequence for each possible key $x$.

Fact: difficult to analyze rigorously.

Heuristic assumption (for a quick and dirty idealized analysis only): all $n!$ probe sequences equally likely.

Heuristic analysis

Observation: under heuristic assumption, expected insertion time is $\approx \frac{1}{1 - \alpha}$, where $a$ is the load

Proof: A random probe finds an empty slot with probability $1 - \alpha$, so insertion time approximate the number $N$ of coin flips to get “head”, where $\mathbb{P}[\text{“heads”}] = 1 - \alpha$

Note: $$\mathbb{E}[N] = 1 + \alpha\mathbb{E}[N]$$. Therefore, $$\mathbb{E}[N] = \frac{1}{1 - \alpha}$$

Linear Probing

Note: heuristic assumption completely false.

Assume instead: initial probe uniformly random, independent for different keys.

Theorem (Knuth 1962): under above assumption, expected Insertion time is approximate $\frac{1}{(1 - \alpha)^2}$, where $\alpha$ is the load factor.

Bloom Filters

Raison d’être: fast Inserts and Lookups

Comparisons to the Hash Table:

Ingredients:

  1. Array of $n$ bits (so $\frac{n}{\mid S\mid}$ = number of bits per object in data set $S$)
  2. $k$ hash functions $h_1, h_2, .., h_k$ (where $k$ is a small constant)

Insert:

for i = 1, 2, .., k
	set A[h_i(x)] = 1 (whether or not the bit is already set to 1)

Lookup:

return true if A[h_i(x)] = 1 for every i = 1 .. k

Note: no false negatives - if $x$ was inserted, $lookup(x)$ guaranteed to be succeed.

But: false positive if all $k$ $h_i(x)$ is already set to 1 by other insertions.

Heuristics Analysis

Intuition: should be a trade-off between space and error (false positive) probability.

Assume (not justified): all $h_i(x)$’s uniformly random and independent (across different $i$ and $x$).

Setup: $n$ bits, insert data set $\mathcal{S}$ into bloom filter.

Quiz: Under the heuristic assumption, what is the probability that a given bit of the bloom filter (the first bit, say) has been set to 1 after the data set $\mathcal{S}$ has been inserted? Answer: $1 - (1 - \frac{1}{n})^{k\mid \mathcal{S}\mid}$

Also, probability that the given bit is 0 is $(1 - \frac{1}{n})^{k\mid \mathcal{S}\mid}$

Recall: $1 - (1 - \frac{1}{n})^{k\mid \mathcal{S}\mid} \leq 1 - e^{-\frac{k\mid\mathcal{S}\mid}{n}} = 1 - e^{-\frac{k}{b}}$ with $b$ is number of bits per object.

Under assumption, for $x \in \mathcal{S}$, false positive possibility is $\leq (1 - e^{-\frac{k}{b}})^k$. Let $\epsilon = (1 - e^{-\frac{k}{b}})^k$ be the error rate.

How to set $k$? For fixed $b$, $\epsilon$ is minimized by setting $k \approx \ln{2}\cdot b$. Plugging back in,

$$ \epsilon \approx \left(\frac{1}{2}\right)^{b\ln{2}} $$

or

$$ b \approx 1.44\log_2{\frac{1}{e}} $$

Example: with $b = 8$, choose $k = 5$ or $k = 6$, error probability $\approx 2%$.

Quiz

Question 1

Which of the following is not a property that you expect a well-designed hash function to have?

Answer: The hash function should “spread out” every data set (across the buckets/slots of the hash table).

Question 2

Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Assuming simple uniform hashing — that is, with each key mapped independently and uniformly to a random bucket — what is the expected number of keys that get mapped to the first bucket? More precisely, what is the expected cardinality of the set ${k:h(k)=1}$

Answer: $$\frac{n}{m}$$

Question 3

Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Say that two distinct keys $x,y$ collide under $h$ if $h(x)=h(y)$. Assuming simple uniform hashing — that is, with each key mapped independently and uniformly to a random bucket — what is the probability that a given pair $x,y$ of distinct keys collide?

Answer: $$\frac{1}{m}$$

Question 4

Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Assuming simple uniform hashing — that is, with each key mapped independently and uniformly to a random bucket — what is the expected number of pairs of distinct keys that collide? (As above, distinct keys $x,y$ are said to collide if $h(x)=h(y)$.)

Answer: $$\frac{n(n - 1)}{2m}$$

Question 5

To interpret our heuristic analysis of bloom filters in lecture, we considered the case where we were willing to use 8 bits of space per object in the bloom filter. Suppose we were willing to use twice as much space (16 bits per object). What can you say about the corresponding false positive rate, according to our heuristic analysis (assuming that the number $k$ of hash tables is set optimally)? [Choose the strongest true statement.]

$$ \left(\frac{1}{2}\right)^{\ln{2}\times 16} \approx 0.0005 $$

Answer: less than 1%