Purpose: maintain a (possibly evolving) set of stuff.
Using a “key”:
Amazing guarantee: all operations in $\mathcal{O}(1)$ time if
Application: 2-SUM problem. Input: unsorted array $A$ of $n$ integers, goal: determine whether or not there are two numbers $x, y \in A$ with $x + y = t$
Naïve solution: $\Theta(n^2)$ time via brute-force
Better:
Amazing:
Setup: universe $\mathcal{U}$ - generally really big
Goal: want to maintain evolving set $S \subseteq \mathcal{U}$ (generally, of reasonable size).
Naïve solutions:
Solution:
Quiz: consider $n$ people with random birthdays (i.e. with each day of the year equally likely). How large does $n$ need to be before there is at least a 50% chance that two people have the same birthday? Answer: 23
Collisions: distinct $x, y \in \mathcal{U}$ such that $h(x) = h(y)$.
Solution 1: (separate) chaining
Solution 2: open addressing (only one object per bucket)
Note: in hash table with chaining, Insert is $\mathcal{O}(1)$ - insert new object $x$ at front of list in $A[h(x)]$, $\mathcal{O}(\text{list length})$ (could be anywhere from $\frac{m}{n}$ - equal-length lists to $m$ - all objects in the same bucket for $m$ objects) for Insert/Delete
Point: performance depends on the choice of hash function!
Properties of a good “hash” function:
Example: keys = phone numbers (10-digit), $\mathcal{U} = 10^{10}$. Choose $n = 10^3$
Objects ->(hash code)->Integers ->(compression function)->Buckets
How to choose $n$ - number of buckets:
Definition: the load factor of a hash table is
$$ \alpha = \frac{\text{number of objects in a hash table}}{\text{number of buckets in the hash table}} $$
Quiz: what hash table implementation strategy is feasible for load factors larger than 1? Answer: Only chaining
Note:
Upshot:
Idea: use super-clear hash function guaranteed to spread every data set out evenly
Problem: does not exist! (for every hash function, there is a pathological data set).
Reason: fix a hash function $h: \mathcal{U}\mapsto {0, 1, 2, .., n - 1}$. A la pigeonhole principle: there is a bucket $i$ such that at least $\frac{\mid \mathcal{U}\mid}{n}$ elements of $\mathcal{U}$ hash to $i$ under $h$.
Solutions:
Definition: let $\mathcal{H}$ be a set of hash functions from $\mathcal{U}$ to ${0, 1, 2, .., n - 1}$. $\mathcal{H}$ is universal if and only if:
$$ \forall x, y \in \mathcal{U}, x \neq y, \mathbb{P}_{h \in \mathcal{H}}[\text{$x, y$ collide (i.e. $h(x) = h(y)$)}] \leq \frac{1}{n} $$ where $n$ is number of buckets. When $h$ is chosen uniformly at random from $\mathcal{H}$ (i.e. collision probability as small as with “gold standard” of perfectly random hashing).
Quiz: Consider a hash function family $\mathcal{H}$, where each hash function of $\mathcal{H}$ maps elements from a universe $\mathcal{U}$ to one of $n$ buckets. Suppose $\mathcal{H}$ has the following property: for every bucket $i$ and key $k$, a $\frac{1}{n}$ fraction of the hash function of $\mathcal{H}$ map $k$ to $i$. Is $\mathcal{H}$ universal? Answer: maybe yes, maybe no, depending on the $\mathcal{H}$:
Let $\mathcal{U}$ = IP addresses of the form $(x_1, x_2, x_3, x_4)$ with $x_i \in {0, 1, 2, .., 255}$. Let $n$ be a prime (e.g. small multiple of number of objects in the hash table).
Construction: define one hash function on $h_a$ per 4-tuple $a = (a_1, a_2, a_3, a_4)$ with each $a_i \in {0, 1, .., n - 1} - n^4$ such functions.
Define: $h_a$: IP addresses $\mapsto$ buckets by $h_a(x_1, x_2, x_3, x_4) = (a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4)\pmod n$
Define: $\mathcal{H} = {h_a \mid a_1, a_2, a_3, a_4 \in {0, 1, .., n - 1}}$
Theorem: this family is universal.
Consider distinct IP addresses $(x_1, x_2, x_3, x_4)$ and $(y_1, y_2, y_3, y_4)$. Assume that $x_4\neq y_4$, collision probability (i.e. $\mathbb{P}_{h_a \in \mathcal{H}}[h_a(x_1, x_2, x_3, x_4) = h_a(y_1, y_2, y_3, y_4)]$?
Note: collision
$$ \begin{aligned} &\iff a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 \pmod n \equiv a_1y_1 + a_2y_2 + a_3y_3 + a_4y_4 \pmod n\ &\iff a_4(x_4 - y_4) \pmod n \equiv \sum_{i = 1}^3 a_i(y_i - x_i) \pmod n \end{aligned} $$ Next: condition on random choices of $a_1, a_2, a_3$ (with $a_4$ still random).
Key claim: left hand side equally likely to be any of ${0, 1, 2, .., n - 1}$. Reason: $x_4 \neq y_4 (x_4 - y_4) \not\equiv 0 \pmod n$, $n$ is prime, $a_4$ random at uniform.
Addendum: make sure $n$ is bigger than the maximum value of an $a_i$
Proof by example: let $n = 7, x_4-y_4 \equiv 2$ or $3 \pmod n$
Hence, this implies
$$\mathbb{P}_{h_a \in \mathcal{H}}[h_a(x) = h_a(y)] = \frac{1}{n}$$
Scenario: hash table implemented with chaining. Hash function $h$ chosen uniformly random from universal family $\mathcal{H}$.
Theorem (Carter-Wegman 1979): all operations run in $\mathcal{O}(1)$ time.
Caveats:
Will analyze an unsuccessful lookup (other operations only faster).
Let $S$ be the data set with $\mid S\mid \in \mathcal{O}(n)$. Consider lookup for $x \notin S$
Running time: $\mathcal{O}(1)$ (compute $h(x)$) + $\mathcal{O}(\text{list length in } A[h(x)])$ (traverse list, where the list length is a random variable depends on the choice of function $h$)
Let $L$ be the list length in $A[h(x)]$. For $y \in S$ (so $x \neq y$), define
$$ Z_y = \begin{cases} 1 &\text{if} &h(y) = h(x)\ 0 &\text{otherwise} \end{cases} $$ Note: $$L = \sum_{y \in S}Z_y$$ Hence, $$\mathbb{E}[L] = \sum_{y \in S} \mathbb{E}[Z_y]$$, and $$\mathbb{E}[Z_y] = \mathbb{P}[h(x) = h(y)]$$ Thus, $$\mathbb{E}[L] = \sum_{y \in S} \mathbb{E}[Z_y]=\sum_{y \in S}\mathbb{P}[h(y)=h(x)]\leq \sum_{y \in S} \frac{1}{n}=\frac{\mid S\mid}{n}$$ which is load $\alpha \in \mathcal{O}(1)$
Recall: one object per slot. hash function produces a probe sequence for each possible key $x$.
Fact: difficult to analyze rigorously.
Heuristic assumption (for a quick and dirty idealized analysis only): all $n!$ probe sequences equally likely.
Observation: under heuristic assumption, expected insertion time is $\approx \frac{1}{1 - \alpha}$, where $a$ is the load
Proof: A random probe finds an empty slot with probability $1 - \alpha$, so insertion time approximate the number $N$ of coin flips to get “head”, where $\mathbb{P}[\text{“heads”}] = 1 - \alpha$
Note: $$\mathbb{E}[N] = 1 + \alpha\mathbb{E}[N]$$. Therefore, $$\mathbb{E}[N] = \frac{1}{1 - \alpha}$$
Note: heuristic assumption completely false.
Assume instead: initial probe uniformly random, independent for different keys.
Theorem (Knuth 1962): under above assumption, expected Insertion time is approximate $\frac{1}{(1 - \alpha)^2}$, where $\alpha$ is the load factor.
Raison d’être: fast Inserts and Lookups
Comparisons to the Hash Table:
Ingredients:
Insert:
for i = 1, 2, .., k
set A[h_i(x)] = 1 (whether or not the bit is already set to 1)
Lookup:
return true if A[h_i(x)] = 1 for every i = 1 .. k
Note: no false negatives - if $x$ was inserted, $lookup(x)$ guaranteed to be succeed.
But: false positive if all $k$ $h_i(x)$ is already set to 1 by other insertions.
Intuition: should be a trade-off between space and error (false positive) probability.
Assume (not justified): all $h_i(x)$’s uniformly random and independent (across different $i$ and $x$).
Setup: $n$ bits, insert data set $\mathcal{S}$ into bloom filter.
Quiz: Under the heuristic assumption, what is the probability that a given bit of the bloom filter (the first bit, say) has been set to 1 after the data set $\mathcal{S}$ has been inserted? Answer: $1 - (1 - \frac{1}{n})^{k\mid \mathcal{S}\mid}$
Also, probability that the given bit is 0 is $(1 - \frac{1}{n})^{k\mid \mathcal{S}\mid}$
Recall: $1 - (1 - \frac{1}{n})^{k\mid \mathcal{S}\mid} \leq 1 - e^{-\frac{k\mid\mathcal{S}\mid}{n}} = 1 - e^{-\frac{k}{b}}$ with $b$ is number of bits per object.
Under assumption, for $x \in \mathcal{S}$, false positive possibility is $\leq (1 - e^{-\frac{k}{b}})^k$. Let $\epsilon = (1 - e^{-\frac{k}{b}})^k$ be the error rate.
How to set $k$? For fixed $b$, $\epsilon$ is minimized by setting $k \approx \ln{2}\cdot b$. Plugging back in,
$$ \epsilon \approx \left(\frac{1}{2}\right)^{b\ln{2}} $$
or
$$ b \approx 1.44\log_2{\frac{1}{e}} $$
Example: with $b = 8$, choose $k = 5$ or $k = 6$, error probability $\approx 2%$.
Which of the following is not a property that you expect a well-designed hash function to have?
Answer: The hash function should “spread out” every data set (across the buckets/slots of the hash table).
Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Assuming simple uniform hashing — that is, with each key mapped independently and uniformly to a random bucket — what is the expected number of keys that get mapped to the first bucket? More precisely, what is the expected cardinality of the set ${k:h(k)=1}$
Answer: $$\frac{n}{m}$$
Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Say that two distinct keys $x,y$ collide under $h$ if $h(x)=h(y)$. Assuming simple uniform hashing — that is, with each key mapped independently and uniformly to a random bucket — what is the probability that a given pair $x,y$ of distinct keys collide?
Answer: $$\frac{1}{m}$$
Suppose we use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Assuming simple uniform hashing — that is, with each key mapped independently and uniformly to a random bucket — what is the expected number of pairs of distinct keys that collide? (As above, distinct keys $x,y$ are said to collide if $h(x)=h(y)$.)
Answer: $$\frac{n(n - 1)}{2m}$$
To interpret our heuristic analysis of bloom filters in lecture, we considered the case where we were willing to use 8 bits of space per object in the bloom filter. Suppose we were willing to use twice as much space (16 bits per object). What can you say about the corresponding false positive rate, according to our heuristic analysis (assuming that the number $k$ of hash tables is set optimally)? [Choose the strongest true statement.]
$$ \left(\frac{1}{2}\right)^{\ln{2}\times 16} \approx 0.0005 $$
Answer: less than 1%