Coursera

Huffman codes

Binary codes

Binary code: maps each character of an alphabet $\Sigma$ to a binary string.

Example: $\Sigma = a-z$ and various punctuations (size 32 overall, say)

Obvious encoding: use the 32 5-bit binary strings to encode this $\Sigma$ (a fixed-length code)

Can we do better? yes, if some characters of $\Sigma$ are much more frequent than others, using a variable-length code.

Ambiguity

Example: suppose $\Sigma = {A, B, C, D}$. Fixed-length encoding would be ${00, 01, 10, 11}$. Suppose instead we use the encoding ${0, 01, 10, 1}$. What is $001$? - not enough information to answer question.

Prefix-free codes

Problem: with variable-length codes, not clear where one character ends & the next one begins.

Solution: prefix-free codes - make sure that for every pair $i, j \in \Sigma$, neither of the encodings $f(i), f(j)$ is a prefix of the other.

Example: ${0, 10, 110, 111}$

Example

$\Sigma$	Frequencies	Fixed-length	Variable-length (prefix free)
A	60%	$00$	$0$
B	25%	$01$	$10$
C	10%	$10$	$110$
D	5%	$11$	$111$

Fixed-length encoding: 2 bits per character.

Variable-length encoding: how many bits needed per average? Answer: $$ \mathbb{E}[X] = 0.6\times 1 + 0.25\times 2 + 0.1\times 3 + 0.05\times 3 = 1.55 $$

Problem definition

Codes as Trees

Goal: best binary prefix-tree encoding for a given set of character frequencies.

Useful fact: thinking of binary codes as binary trees.

Examples with $\Sigma = {A, B, C, D}$

![[Pasted image 20240224131759.png]]

Prefix-free codes as trees

In general:

left child edges maps to $0$, right child edges maps to $1$.
for each $i \in \Sigma$, exactly one node labeled $i$.
encoding of $i \in \Sigma$, bits along path from root to the node $i$.
prefix-tree: labelled nodes = the leaves (since prefixes = one node an ancestor of another).

To decode: repeatedly follow path from root until you hit a leaf (ex: $0110111 \mapsto ACD$).

Note: encoding length of $i \in \Sigma$ is the depth of $i$ in tree.

Definition

Input: probability $p_i$ for each character $i \in \Sigma$

Notation: if $T$ is the tree with leaves (equals to symbols of $\Sigma$) then $$ L(T) = \sum_p_t\cdot[\text{depth of i in T}] $$ Example: if $p_A = 60%, p_B = 25%, p_C = 10%, p_D = 5%$, then $L(T) = 1.55$

Output: a binary tree $T$ minimizing the average encoding length $L(T)$

A greedy algorithm

Problem definition

Input: probability $p_i$ for each character $i \in \Sigma$

Output: binary tree (with leaves are symbols of $\Sigma$ ), minimizing the average encoding length: $$ L(T) = \sum_p_t\cdot[\text{depth of i in T}] $$

Building a tree

Question: what’s a principled approach for building a tree with leaves are symbols of $\Sigma$?

Natural but suboptimal idea: top-down / divide and conquer

Partition $\Sigma$ into $\Sigma_1, \Sigma_2$ each with $\approx 50%$ of total frequency
Recursively compute $T_1$ for $\Sigma_1$, $T_2$ for $\Sigma_2$, return the tree $T$ with two branch $T_1$ and $T_2$

Huffman’s (optimal) idea

Build tree button-up using successive mergers

A Greedy Approach

Question: which pair of symbols is “safe” to merge?

Observation: final encoding length of $i \in \Sigma$ is the number of mergers its subtree endures.

How to recurse?

Suppose: 1st iteration of algorithm merges symbols $a$ and $b$.

Idea: replace the symbols $a, b$ by a new “meta-symbol” $ab$.

Quiz: Suppose in the first iteration of the algorithm, we merge symbols $a$ and $b$, replace them by a new “meta-symbol” $ab$. How should we define the frequency $p_{ab}$ of this meta-symbol? Answer: $p_a + p_b$.

Example: $p_A = 60%, p_B = 25%, p_C = 10%, p_D = 5%$

![[Pasted image 20240224181949.png]]

Given frequencies $p_i$ as input.

If $\lvert\Sigma\lvert= 2$ , return the tree with left = $A$, right = $B$
Let $a, b \in \Sigma$ have the smallest frequencies..
Let $\Sigma’ = \Sigma$ with $a, b$ replaced by the new symbol $ab$
Define $p_{ab} = p_a + p_b$
Recursively compute $T’$ (to the alphabet $\Sigma’$)
Extend $T’$ (with leaves are $\Sigma’$) to a tree $T$ with leaves are $\Sigma$ by splitting leaf $ab$ into two leaves $a$ and $b$
Return $T$

A more complex example

Input and step 1, 2

Input: character weights

$A$	$B$	$C$	$D$	$E$	$F$
3	2	6	8	2	6

Step 1: Merge $B$ and $E$

$A$	$BE$	$C$	$D$	$F$
3	4	6	8	6

Step 2: Merge $A$ and $BE$

$ABE$	$C$	$D$	$F$
7	6	8	6

Step 3: Merge $C$ and $F$

$ABE$	$CF$	$D$
7	12	8

Step 4: Merge $ABE$ and $D$

$ABED$	$CF$
15	12

Step 5: Merge remaining trees

Final output ![[Pasted image 20240224222928.png]]

Corresponding code

Character	Code
$A$	$000$
$B$	$0010$
$C$	$10$
$D$	$01$
$E$	$0011$
$F$	$11$

Correctness proof

Theorem

Theorem (Huffman 52): Huffman’s algorithm computes a binary tree (with leaves are symbols of $\Sigma$) that minimizes the average encoding length

$$ L(T) = \sum_p_t\cdot[\text{depth of i in T}] $$

Proof: by induction on $n = \lvert\Sigma\rvert$ (can assume $n \geq 2$)

Base case: when $n = 2$, algorithm outputs the optima tree (need 1 bit per symbol)

Inductive step: Fix input with $n = \lvert \Sigma\rvert > 2$

By inductive hypothesis: algorithm solves smaller subproblem for $\Sigma’$ optimally,

Inductive step

Let $\Sigma’ = \Sigma$ with $a, b$ (symbols with smallest frequencies) replaced by meta-symbol $ab$. Define $p_{ab} = p_a + p_b$

Recall: exact correspondence between trees for $\Sigma’$ and $X_{ab}$ - trees for $\Sigma$ that have $a, b$ as siblings.

Important: for every such pair $T’$ and $T’$, $L(T) - L(T’)$ is (after cancellation):

$$ L(T) - L(T’) = p_a\cdot [\text{depth of a in T}] + p_b\cdot[\text{depth of b in T}] - p_{ab}\cdot[\text{depth of ab in T’}] $$

However, $p_{ab} = p_a + p_b$, each depth of $a$ in $T$ and depth of $b$ in $T$ is $1$ more than depth of $ab$ in $T’$.

Therefore,

$$ \begin{aligned} L(T) - L(T’) &= p_a(d + 1) + p_b(d + 1) - (p_a + p_b)d\ &= p_a + p_b \end{aligned} $$

Hence, $T$ and $T’$ are independent!

Inductive hypothesis: Huffman’s algorithm computes a tree $\hat{T}’$ that minimizes $L(T’)$ for $\Sigma’$

Upshot: corresponding tree $\hat{T}$ minimizes $L(T)$ for $\Sigma$ over all trees in $X_{ab}$ (i.e. where $a$ and $b$ are siblings).

Key lemma (completes proof of theorem): there is an optimal tree (for $\Sigma$) in $X_{ab}$ (i.e. $a$ and $b$ were “safe” to merge).

Intuition: can make an optimal tree better by pushing $a$ and $b$ as deep as possible (since $a, b$ have smallest frequencies).

Proof of key lemma

By exchange argument. Let $T^*$ be any tree that minimizes $L(T)$ for $\Sigma$.

Let $x, y$ be siblings at the deepest level of $T^*$.

The exchange: obtain $\hat{T}$ from $T^*$ by

swapping labels $a \longleftrightarrow x$, $b \longleftrightarrow y$.

Note: $\hat{T} \in X_{ab}$ (by choice of $x, y$)

To finish: will show that $L(\hat{T}) \leq L(T^*)$ (therefore $\hat{T}$ is also optimal, completes proof).

Reason: $$L(T^) - L(\hat{T}) = (p_x - p_a)[\text{depth of x in }T^{} - \text{depth of a in }T^{}] + (p_y - p_b)[\text{depth of y in }T^{} - \text{depth of b in }T^{*}]$$

Note that

Both $(p_x - p_a)$ and $(p_y - p_b) \geq 0$ since $a, b$ have smallest frequencies.
Depth of $x, y, a, b$ also $\geq 0$ by choice of $x, y$

Therefore $L(T^*) - L(\hat{T}) \geq 0$ - QED!.

Notes on running time

Naïve implementation: $\mathcal{O}(n^2)$ where $n = \lvert\Sigma\rvert$

Speed ups: use a heap! (to perform repeated minimum computations)

Use keys = frequencies
After extracting the two smallest frequency symbols, re-Insert the new meta-symbol (new key = sum of the 2 old ones). Therefore, this is an iterative $\mathcal{O}(n\log{n})$ implementation.

Even faster (non-trivial exercise): sorting + $\mathcal{O}(1)$ additional work:

manage (meta-) symbols using two queues.

Introduction to Dynamic Programming (DP)

Problem statement

Input: a path graph $G = (V, E)$ with non-negative weights on vertices.

Desired output: subset of nonadjacent vertices - an independent set - of maximum total weight.

Next: iterate through our algorithm design principles.

A greedy approach

Greedy: iteratively choose the max-weight vertex not adjacent to any previously chosen vertex.

Question: consider the four-vertex path graph with vertex weights $1, 4, 5$ and $4$. What is the value of the maximum-weight independent set, and what is the value output of this greedy algorithm?

Answer: $8$ and $6$.

A divide-and-conquer approach

Idea: recursively compute the max-weight independent set (IS) of first half, same for the second half, then combine the solutions.

Problem: what if recursive sub-solutions conflict? - not clear how to quickly fix.

Optimal Substructure

Critical step: reason about structure of an optimal solution (in tons of optimal solutions of smaller subproblems)

Motivation: this thought experiment narrows down the set of candidates for the optimal solution; can search through the small set using brute-force search.

Notation: Let $S \subseteq V$ be a max-weight independent set (IS). Let $v_n$ be the last vertex of the path.

A Case Analysis

![[Pasted image 20240225122930.png]]

Case 1: suppose $v_n \notin S$. Let $G’ = G$ with $v_n$ included.

Note: $S$ also an IS of $G’$
Note: $S$ must be a max-weight IS of $G’$ - if $S^*$ was better, it would also be better than $S$ in $G$ (contradiction).

Case 2: suppose $v_n \in S$.

Note: previous vertex $v_{n - 1} \notin S$ (by definition of an IS). Let $G’’ = G$ with $v_{n - 1}, v_n$ included.
Note: $S - {v_n}$ is an IS of $G’’$.
Note: must in fact be a max-weight IS of $G’’$ - if $S^$ is better than $S$ in $G’’$, then $S^ \cup {v_n}$ is better than $S$ in $G$ (contradiction).

Toward an Algorithm

Upshot: a max-weight IS must be either

a max-weight IS of $G’$ or
$v_n$ and a max-weight IS of $G’’$

Corollary: if we knew whether or not $v_n$ was in the max-weight IS, call recursively compute the max-weight IS of $G’$ or $G’’$ and be done.

(Crazy?) idea: try both possibilities and return the better solution.

A linear-time algorithm

Upshot: if we knew whether or not $v_n$ is in the max-weight IS, then could recursively compute the max-weight IS of $G’$ or $G’’$ and be done.

Proposed algorithm:

Recursively compute $S_1$ be the max-weight IS of $G’$
Recursively compute $S_2$ be the max-weight IS of $G’$
Return $S_1$ or $S_2 \cup {v_n}$ , whichever is better

Good news: correct (formally proof by induction)

Bad news: exponential time

The $64000 question

Important question: how many distinct subproblems ever got solved by this algorithm? Answer: $\Theta(n)$ - only 1 for each “prefix” of the graph! (recursion on only plucks vertices off from the right)

Eliminating Redundancy

Obvious fix: the first time you solve a subproblem, cache its solution in a global take for $\mathcal{O}(1)$-time lookup later on (“memoization”)

Even better: reformulate as a bottom-up iterative algorithm.

Let $G_i$ be the first $i$ vertices of $G$.

Plan: populate array $A$ left to right with $A[i]$ be the value of max-weight IS of $G_i$. Initialization: $A[0] = 0, A[1]=w_1$.

Main loop:

for i = 2, 3, 4, .., n
	A[i] = max(A[i - 1], A[i - 2] + w_i)

Running time: obviously $\mathcal{O}(n)$

Correctness: same as recursive version.

A reconstruction algorithm

Optimal value vs. Optimal solution

Note: algorithm computes the value of a max-weight IS, not such an IS itself.

Correct but not ideal: store optimal IS of each $G_i$ in the array in addition to its value

Better: trace-back through filled-in array to reconstruct optimal solution

Key point: we know that a vertex $v_i$ belongs to a max-weight IS of $G_i$ if and only if $w_i$ + max-weight IS of $G_{i - 2} \geq$ max-weight IS of $G_{i - 1}$ - follows from correctness of our algorithm.

A reconstruction algorithm

Let A = filled-in array A
Let S = empty
While i >= 1 (scan through array from right to left)
	if A[i - 1] >= A[i - 2] + w_i
		decrease i by 1
	else
		add v_i to S
		decrease i by 2
return S

Claim: (by induction or case analysis) final outputs is a max-weight IS of $G$.

Running time: $\mathcal{O}(n)$

Principles of Dynamic Programming

Fact: our WIS algorithm is a dynamic programming algorithm!

Key ingredients of dynamic programming

identify a small number of subproblems (e.g. compute the max-weight IS of $G_i$ for $i = 0, 1, 2, 3, .., n$)
can quickly and correctly solve “larger” subproblems given the solutions to “smaller subproblems” (usually via a recurrence such as $A[i] = \max{A[i - 1], A[i - 2] + w_i}$)
after solving all subproblems, can quickly compute the final solution (usually it’s just the answer to the “biggest” solution).

Quiz

Question 1

Consider an alphabet with five letters ${a, b, c, d, e}$, and suppose we know the frequencies $f_a = 0.32, f_b = 0.25, f_c = 0.2, f_d = 0.18$ and $f_e = 0.05$. What is the expected number of bits used by Huffman’s coding scheme to encode a 1000-letter document?

[p] $2230$
[c] $2400$
[c] $3000$
[c] $3450$

Question 2

Under a Huffman encoding of $n$ symbols, how long (in terms of number of bits) can a codeword possibly be?

[c] $n$
[c] $\log_2{n}$
[c] $\ln{n}$
[p] $n - 1$

Question 3

Which of the following statements holds for Huffman’s coding scheme?

[p] If the most frequent letter has frequency less than $0.33$, than all letters will be coded with at least two bits
[c] A letter with frequency at least $0.5$ might get encoded with two or more bits
[c] If a letter’s frequency is at least $0.4$, then the letter will certainly be coded with only one bit.
[c] If the most frequent letter has frequency less than $0.5$, then all letters will be coded with more than one bit.

Question 4

Which of the following is true for our dynamic programming algorithm for computing a maximum-weight independent set of a path graph? (Assuming there are no ties)

[c] The algorithm always selects the maximum-weight vertex
[c] As long as the input graph has at least two vertices, the algorithm never selects the minimum-weight vertex.
[p] If a vertex is excluded from the optimal solution of two consecutive subproblems, then it is excluded from the optimal solutions of all bigger subproblems
[c] If a vertex is excluded from the optimal solution of a subproblem, then it is excluded from the optimal solutions of all bigger subproblems

Question 5

Recall our dynamic programming algorithm for computing the maximum-weight independent set of a path graph. Consider the following proposed extension to more general graphs. Consider an undirected graph with positive vertex weights. For a vertex $v$, obtain the graph $G’(v)$ by deleting $v$ and its incident edges from $G$, and obtain the graph $G’’(v)$ by deleting $v$, its neighbors, and all of the corresponding incident edges from $G$. Let $\text{OPT}(H)$ denote the value of a maximum-weight independent set of a graph $H$. Consider the formula

$$ \text{OPT}(G)=\max{\text{OPT}(G’(v)), w_v + \text{OPT}(G’’(v))} $$

where $v$ is an arbitrary vertex of $G$ of weight $w_v$. Which of the following statements is true?

[c] The formula is always correct in trees, but does not lead to an efficient dynamic programming algorithm.
[c] The formula is always correct in general graphs, and it leads to an efficient dynamic programming algorithm.
[c] The formula is correct in path graphs but is not always correct in trees.
[p] The formula is always correct in trees, and it leads to an efficient dynamic programming algorithm.