Coursera

The knapsack problem

Problem definition

Input: $n$ items. Each has a value

Output: a subset $S \subseteq {1, 2, 3, .., n}$ that maximizes $\sum_{i\in S}v_i$ subject to $\sum_{i\in S}w_i \leq W$

Developing a DP Algorithm

Step 1: formulate recurrence (optimal solution as function of solutions to “smaller” subproblems) based on structure of an optimal solution

Let $S$ be the max-value solution to an instance of Knapsack.

Case 1: suppose item $n\notin S$. Therefore $S$ must be optimal with the first $n - 1$ items (same capacity $W$). If $S^*$ were better than $S$ with respect to first $n - 1$ items, then this equally true with all $n$ of the items, but that contradicts the purported optimality of $S$.

Case 2: suppose item $n \in S$. Then $S - {n}$ is an optimal solution with respect to the first $n - 1$ items and capacity $W - w_n$.

Proof: if $S^$ has higher value than $S - {n}$ + total size $\leq W - w_n$, then $S^\cup{n}$ has size $\leq W$ and value more than $S$ (contradiction).

Notation: Let $V_{i, x}$ be the value of the best solution that

  1. uses only the first $i$ items
  2. has total size $\leq x$. n

Upshot of last video: for $i\in {1, 2, .., n}$ and any $x$, $$ V_{i, x}=\max{V_{(i - 1), x}, v_i + V_{(i - 1), x - w_i}} $$ Edge case: if $w_i > x$, must have $V_{i, x} = V_{(i- 1), x}$

Step 2: identify the subproblems

Step 3: use recurrence from Step 1 to systematically solve all subproblems

Let A is a 2-D array
Initialize A[0, x] = 0 for x = 0, 1, 2, .., W
for i = 1, 2, 3, .., n:
	for x = 0, 1, 2, 3, .., W:
		if x - w_i >= 0 (ignore case where w_i > x)
			A[i, x] = max(A[i - 1, x], A[i - 1, x - w_i] + v_i)
return A[n, W]

We can easily see that previously results are available for $\mathcal{O}(1)$ time lookup.

What is the running time of this algorithm? $\Theta(nW)$ - solve each in $\Theta(1)$ time.

Correctness: straightforward induction (use step 1 argument to justify inductive step).

Sequence Alignment

Problem definition

Recall: sequence alignment (Needleman-Wunsch score = similarity measure between strings)

Example: AGGGCT - AGG-CA

Total penalty = $\alpha_{gap} + \alpha_{AT}$

Input: strings $X = x_1, x_2, .., x_m$, $Y = y_1, y_2, .., y_n$ over some alphabet $\Sigma$ (for example ${A, C, G, T}$)

Feasible solutions: alignments - i.e. insert gaps to equalize lengths of the strings.

A dynamic programming approach

Key step: identify subproblems. As usual, will look at structure of an optimal solution for Clues (i.e. develop a recurrence and then reverse engineer the subproblems)

Structure of optimal solution: consider an optimal alignment of $X$, $Y$ and its final position:

Question: how many relevant possibilities are there for the contents of the final position? Answer: 3

Case 1: $x_m, y_n$ matched Case 2: $x_m$ matched with a gap Case 3: $y_n$ matched with a gap

Point: narrow optimal solution down to 3 candidates.

Optimal substructure: let $X’ = X - x_m, Y’ = Y - y_n$.

Proof (of case 1, other cases are similar): by contradiction, suppose induced alignment of $X’, Y’$ has penalty $P$ while some other one has penalty $P^* < P$. Therefore, appending $x_m - y_n$ (contents of final position) to the latter, get an alignment of $X$ and $Y$ with penalty $P^* + \alpha_{x_my_n} < P + \alpha_{x_my_n}$ - penalty of original alignment. Therefore, this contradicts optimality of original alignment of $X$ and $Y$.

Relevant subproblems: have the form $(X_i, Y_j)$ where $X_i$ be the first $i$ letters of $X$, $Y_j$ be the first $j$ letters of $Y$ (since only peel off letters from the right ends of the strings).

The recurrence

Notation: let $P_{ij}$ be the penalty of optimal alignment of $X_i$ and $Y_j$.

Recurrence: for all $i = 1, 2, 3, .., n$ and $j = 1, 2, 3, .., n$,

$$ P_{ij}= \begin{cases} \alpha_{x_iy_j} + P_{(i - 1), (j - 1)}\ \alpha_{gap} + P_{(i - 1), j}\ \alpha_{gap} + P_{i, (j - 1)} \end{cases} $$

Correctness: optimal solution is one of these 3 candidates, and recurrence selects the best of 3.

Base cases

Question: what is the value of $P_{i, 0}$ and $P_{0, i}$? Answer: $i\alpha_{gap}$

The algorithm

A = 2-d array
A[i, 0] = A[0, i] = i

for i = 1 to m
	for j = 1 to n
		A[i,j] = min(A[i - 1, j - 1] + alpha[x_i][y_i], A[i - 1, j] + alpha[gap], A[j - 1] + alpha[gap])
		(All available for O(1) lookup)

Correctness: (i.e. $A_{ij} = P_{ij}\ \forall i, j \geq 0$) follows from induction & correctness of recurrence

Running time: $\mathcal{O}(mn)$ - $\Theta(1)$ for each of $\Theta(mn)$ subproblems.

Reconstructing a solution

Trace back through filled-in table $A_i$ storing at $A_{mn}$. Upon reaching subproblem $A_{ij}$:

If $i = 0$ or $j = 0$, match remaining substring with gaps

Running time: $\mathcal{O}(m + n)$.

Optimal BST

Problem definition

Recall: for a given set of keys, there are lots of valid search trees (say $x < y < z$).

Question: what is the “best” search tree for a given set of keys?

A good answer: a balanced search tree, like a red-black tree (worst-case search time = $\Theta(height) = \Theta(\log{n})$).

Question: suppose we have keys $x < y < z$ and we know that

What is the average search time (i.e. the number of nodes looked at) in the trees with root $y$ (left child $x$ and right child $z$) and with root $x$ (right child $y$ and right grandchild $z$), respectively? Answer: 1.9 and 1.3

Input: frequencies $p_1, p_2, .., p_n$ for items $1, 2, 3, .., n$ (assuming items in sorted order i.e. $1 < 2 < 3 < .. < n$).

Goal: compute a valid search tree that minimizes the weighted (average) search time

$$ C(T) = \sum_{\text{items } i}P_i\cdot\text{search time for $i$ in $T$ (depth of $i$ in $T + 1$)} $$

(assuming $\sum_i p_i = 1$)

Example: if $T_i$’s a red-black tee, then $C(T) \in \mathcal{O}(\log{n})$.

Comparison with Huffman codes:

Similar:

Difference:

Intuition: want the most (respectively, least) frequently accessed items closest (respectively, furthest) from the root.

Ideas for greedy algorithms:

Choosing the root

Issue: with the top-down approach, the choice of root has hard-to-predict repercussions further down the tree (stymies both greedy and naïve divide and conquer approaches)

Idea: what if we knew the cost? (i.e. maybe can try all possibilities within a dynamic programming algorithm!)

Question: suppose an optimal BST for keys ${1, 2, .., n}$ has root $r$, left subtree $T_1$, right subtree $T_2$. Pick the strongest statement that you suspect is tree. Answer: $T_1$ is optimal for the keys ${1, 2, .., r-1}$ and $T_2$ is optimal for the keys ${r + 1, r + 2, .., n}$.

Proof of optimal substructure

Let $T$ be an optimal BST for keys ${1, 2, .., n}$ with frequencies $p_1, p_2, .., p_n$. Suppose $T$ has root $r$. Suppose for contradiction that $T_1$ is not optimal for ${1, 2, .., r - 1}$ with $C(T^*_1) < C(T_1)$ (other case is similar).

Obtain $T^$ from $T$ by “cutting + pasting” $T^_1$ in for $T_1$. Note: to complete contradiction and proof, only need to show that $C(T^*) < C(T)$

A calculation:

$$ \begin{aligned} C(T) &= \sum_{i = 1}^nP_i\cdot [\text{search time for $i$ in $T$}]\ &= P_r\cdot 1 + \sum_{i = 1}^{r - 1} P_i\cdot [\text{search time for $i$ in $T$}]+\sum_{i = r + 1}^{n} P_i\cdot [\text{search time for $i$ in $T$}]\ &= \sum_{i = 1}^nP_i + \sum_{i = 1}^{r - 1} P_i\cdot [\text{search time for $i$ in $T_1$}]+\sum_{i = r + 1}^{n} P_i\cdot [\text{search time for $i$ in $T_2$}] \end{aligned} $$

Similarly:

$$ C(T^) = \sum_{i = 1}^n P_i + C(T^_1) + C(T_2) $$

Upshot: $C(T^_1) < C(T_1)$ implies $C(T^) < C(T)$, which contradicting optimality of $T$.

A Dynamic Programming

Optimal substructure lemma: If $T$ is an optimal BST for the keys ${1, 2, .., n}$ with root $r$, then its subtrees $T_1$ and $T_2$ are optimal BSTs for the keys ${1, 2, .., r - 1}$ and ${r + 1, .., n}$, respectively.

Note: items in a subproblem are either a prefix or a suffix of the original problem.

Question: let ${1, 2, .., n}$ be original items. For which subsets $S \subseteq {1, 2, .., n}$ might we need to compute the optimal binary search tree for $S$? Answer: contiguous intervals ($S = {i, i + 1, .., j - 1, j}$ for every $i \leq j$)

The recurrence

Notation: for $1\leq i \leq j \leq n$, let $C_{ij}$ be weighted search cost of an optimal BST for the items ${i, i + 1, .., j - 1, j}$ (with probabilities $p_i, p_{i + 1}, .., p_j$)

Recurrence: for every $1 \leq i \leq j \leq n$:

$$ C_{ij} = \min_{r = i}^j\left{\sum_{k = i}^jp_k + C_{i, r - 1} + C_{r + 1, j}\right} $$

The algorithm

Important: solve smallest subproblems (with fewest number $j - i + 1$ of items) first.

Let A = 2D array
For s = 0 to (n - 1) (s represent C_j - i)
	for i = 1 to n
		A[i, i + s] = min(sum p_k + A[i, r - 1] + A[r + 1, i + s]) (interpret A[i, r - 1] + A[r + 1, i + s] as 0 if 1st index > 2nd index)

return A[1, n]

Running time

Hence, $\Theta(n^3)$ time overall.

Fun fact: (Knuth 71, Yao 80) optimized version of this DP algorithm correctly fills up entire table in only $\Theta(n^2)$ time ($\Theta(1)$ on average per subproblem).

Idea: piggyback on work done in previous subproblems to avoid trying all possible roots.

Quiz

Question 1

Consider a variation of the Knapsack problem where we have two knapsacks, with integer capacities $W_1$​ and $W_2$​. As usual, we are given $n$ items with positive values and positive integer weights. We want to pick subsets $S_1​,S_2$​ with maximum total value (i.e., $\sum_{i \in S_1}v_i + \sum_{i \in S_2}v_i$​) such that the total weights of $S_1​$ and $S_2$​ are at most $W_1$​ and $W_2$​, respectively. Assume that every item fits in either knapsack (i.e., $w_i\leq \min{W_1​,W_2​}$ for every item $i$). Consider the following two algorithmic approaches. (1) Use the algorithm from lecture to pick a a max-value feasible solution $S_1$​ for the first knapsack, and then run it again on the remaining items to pick a max-value feasible solution $S_2$​ for the second knapsack. (2) Use the algorithm from lecture to pick a max-value feasible solution for a knapsack with capacity $W_1​+W_2​$, and then split the chosen items into two sets $S_1​,S_2$​ that have size at most $W_1​$ and $W_2$​, respectively. Which of the following statements is true?

Question 2

Recall the dynamic programming algorithms from lecture for the Knapsack and sequence alignment problems. Both fill in a two-dimensional table using a double-for loop. Suppose we reverse the order of the two for loops. (I.e., cut and paste the second for loop in front of the first for loop, without otherwise changing the text in any way.) Are the resulting algorithms still well defined and correct?

Question 3

Consider an instance of the optimal binary search tree problem with 7 keys (say 1, 2, 3, 4, 5, 6, 7 in sorted order) and frequencies

$i$ $w_i$
1 0.05
2 0.4
3 0.08
4 0.04
5 0.1
6 0.1
7 0.23

What is the minimum-possible average search time of a binary search tree with these keys?

Question 4

The following problems all take as input two strings $X$ and $Y$, of length $m$ and $n$, over some alphabet $\Sigma$. Which of them can be solved in $\mathcal{O}(mn)$ time? (Check all that apply.)

Question 5

Recall our dynamic programming algorithms for maximum-weight independent set, sequence alignment, and optimal binary search trees. The space requirements of these algorithms are proportional to the number of subproblems that get solved: $\Theta(n)$ (where $n$ is the number of vertices), $\Theta(mn)$ (where $m,n$ are the lengths of the two strings), and $\Theta(n^2)$ (where $n$ is the number of keys), respectively.

Suppose we only want to compute the value of an optimal solution (the final answer of the first, forward pass) and don’t care about actually reconstructing an optimal solution (i.e., we skip the second, reverse pass over the table). How much space do you then really need to run each of three algorithms?