Input: given an array $A$ with $n$ distinct numbers and a number $i \in {1, 2, .., n}$.
Output: $i^{th}$ order statistic (i.e. the $i^{th}$ smallest element of the array).
Example: median ($i = \frac{n + 1}{2}$ for $n$ is odd, $i = \frac{n}{2}$ for $n$ is even).
$\mathcal{O}(n\log(n))$ algorithm:
Fact: can’t sort array faster.
Next: $\mathcal{O}(n)$ time (randomized) by modifying Quick Sort.
Key idea: partitioning array around a pivot element.
Note: puts pivot in its “rightful position”.
Question: Suppose we are looking for the 5th order statistic in an input array of length 10. We partition the array, and the pivot winds up in the third position of the partitioned array. On which side of the pivot do we recurse, and what order statistic should we look for?
Answer: the 2nd element order statistic on the right side of pivot.
RSelect(array A, length n, order statistic i)
(0) if n = 1 return A[1]
(1) Choose pivot p from A uniformly at random
(2) Partition A around p
Let j = new index of p
(3) if j = i return p
(4) if j > i return RSelect(1st part of A, j - 1, i)
(5) if j < i return RSelect(2nd part of A, n - j, i - j)
Claim: RSelect is correct (guaranteed to output $i^{th}$ order statistics). Proof by induction.
Running time? depends on the “quality” of the chosen pivots.
Question: what is the running time of RSelect if pivots are always chosen in the worst possible way? Answer: $\Theta(n^2)$.
Example:
Hence, $\Omega(n)$ time in each of $\Omega(n)$ recursive calls.
Key: find pivots giving “balanced” split.
Best pivot: the median - but this is circular. Hence, the recurrence $T(n)$ can be estimated via the master method:
$$ T(n) \leq T\left(\frac{n}{2}\right) + \mathcal{O}(n) \Rightarrow T(n) \in \mathcal{O}(n) $$
Hope: random pivot is “pretty good” “often enough”.
RSelect Theorem: for every input array of size $n$, the average running time of RSelect is $\mathcal{O}(n)$.
Note: RSelect use $\leq Cn$ operations outside of the recursive call (for some constants $C > 0$) - from partitioning.
Notation: RSelect is in phase $j$ iff current array size between $\left(\frac{3}{4}\right)^{j + 1}n$ and $\left(\frac{3}{4}\right)^{j}n$.
Let $X_j$ is the number of recursive calls during phase $j$. Hence, running time of RSelect $\leq \sum_{\text{phases} j}X_j\times c \times \left(\frac{3}{4}\right)^{j}n$, where
Reminding: $X_j$ is the number of recursive calls during phase $j$ - size between $\left(\frac{3}{4}\right)^{j + 1}n$ and $\left(\frac{3}{4}\right)^{j}n$.
Note: if RSelect chooses a pivot giving a 25-75 split (or better) then current phase ends (new subarray length at most 75% of length).
Recall: probability of 25-75 split or better is 50%.
So, $\mathbb{E}[X] \leq$ expected number of times you need to flip a fair coin to get one “heads” (heads $\approx$ good pivot, tails $\approx$ bad pivot).
Let $N$ is the number of coin flips until you get heads. (a “geometric random variable”).
Note: $\mathbb{E}[N] = 1 + \frac{1}{2}\cdot\mathbb{E}[N]$ where
Solution: $\mathbb{E}[N] = 2$ (Recall $\mathbb{E}[X_j] \leq \mathbb{E}[N]$).
$$ \begin{align} \text{Expected running time of RSelect} &\leq Cn \sum_{\text{phase} j} \left(\frac{3}{4}\right)^jX_j \ &=Cn\sum_{\text{phase} j}\left(\frac{3}{4}\right)^j\mathbb{E}[X_j]\ &\leq 2Cn\sum_{\text{phase} j}\left(\frac{3}{4}\right)^j \end{align} $$
Recall: $\sum_{\text{phase} j}\left(\frac{3}{4}\right)^j$ is a geometric sum, $\leq \frac{1}{1 - \frac{3}{4}} = 4$.
Hence,
$$ \text{Expected running time of RSelect} \leq 8Cn \in \mathcal{O}(n) $$
Recall: “best” pivot - the median.
Goal: find pivot guaranteed to be pretty good.
Key idea: median of median.
ChoosePivot(A, n)
Logically break A into n/5 groups of size 5 each.
Sort each group (e.g. using Merge Sort)
Copy n/5 medians (i.e. middle element of each sorted group) into new array C
Recursively compute median of C
Return this as pivot
The DSelect Algorithm’s pseudocode:
DSelect(array A, length n, order statistic i)
Break A into groups of 5, sort each group
C = the n/5 middle elements
p = Select(C, n/5, n/10) // recursively compute meadian of C
Partition A around p
if i = j return p
else if j < i return Select(first part of A, j - 1, i)
else return Select(second part of A, n - j, i - j)
We can easily see, DSelect makes at most 2 recursive calls. Running time of DSelect (DSelectTheorem): for every input of length $n$, DSelect runs in $\mathcal{O}(n)$ time.
Howerer, DSelect is not as good as RSelect in practice
History: from 1973., Blum-Floyd-Pratt-Rivest-Tarjan
What is the asymtotic running time of step 1 of DSelect? $\Theta(n)$
Explanation: sorting an array of 5 elements takes $\leq 120$ operations. With Merge Sort’s $6m(\log_2m+1) \leq \frac{n}{5}120 = 24n \in \mathcal{O}(n)$.
Let $T(n)$ is the maximum running time of DSelect on an input array of length $n$. There is a constant $c \geq 1$ such that
Where
Key lemma: 2nd recursive call (in line 6 or 7) guaranteed to be on an array of size $\leq \frac{7}{10}n$ (roughly)
Upshot: can replace “?” by $\frac{7}{10}n$.
Rough proof:
Let $k= \frac{n}{5}=$ number of groups Let $x_i=i^{th}$ smallest of the $k$ “middle elements” (so pivot = $\frac{k}{2}$).
Goal: at least 30% of input array smaller than $X_{\frac{k}{2}}$, $\geq$ 30% is bigger.
Note: different-sized subproblems can’t use master method.
Strategy: Hope and check, that some constant $a$ (independent of $n$) such that $T(n) \leq an\ \forall n \geq 1$. If true, then $T(n) \in \mathcal{O}(n)$ and algorithm is linear.
Proof: let $a = 10c$, then $T(n) \leq an\ \forall n \geq 1$ (induction by $n$).
Base case: $T(1) = 1 \leq a$
Inductive step: forall $n > 1$,
$$ \begin{aligned} T(n) &\leq cn + T(\frac{n}{5}) + T(\frac{7}{10}n)\ &\leq cn + a(\frac{n}{5}) + a(\frac{7}{10}n)\ &\leq n(c + \frac{9}{10}a)=an \end{aligned} $$
I.e. why sorting can’t faster than $n\log n$
Theorem: every “comparison-based” sorting algorithm has worst-case running time $\Omega(n\log n)$ (assume deterministic, but lower bound extends to randomized).
Comparison-based sort: accesses input array elements only via comparisons aka general-purpose sorting method.
Examples: Merge Sort, Quick Sort, Heap Sort
Non-examples: Bucket Sort (good for data from distributions), Counting Sort (good for small integers), Radix Sort
Proof idea: fix a comparison-based sorting method and an array length $n$. Suppose algorithm always makes $\leq k$ comparisons to correctly sort these $n!$ inputs. That means across all $n!$ possible inputs, algorithm exhibits $\leq 2^k$ distinct executions.
By the pigeon’s hole principle:
If $2^k \leq n!$, execute identically on two distinct inputs must get one of them incorrect. Therefore, since the method is correct,
$$ \begin{aligned} 2^k &\geq n!\ &\geq (\frac{n}{2})^{\frac{n}{2}}\ \Rightarrow k &\geq \frac{n}{2}log_2(\frac{n}{2}) \in \Omega(n\log n) \end{aligned} $$
Ingredients:
Cuts of graphs: a cut of a graph $(V, E)$ is a partition of $V$ into two non-empty sets $A$ and $B$.
The crossing edges of a cut $(A, B)$ are those with
Quiz: how many cuts does a graph with $n$ vertices have? Answer: $2^n$.
Input: an undirected graph $G = (V, E)$
Goal: compute a cut with fewest number of crossing edges (a min cut).
Applications:
Let $n$ be number of vertices, $m$ be number of edges. In most (but not all) applications, $m$ is $\Omega(n)$ and $\mathcal{O}(n^2)$.
In a “sparse graph”, $m$ is $\mathcal{O}(n)$ or close to it.
In a “dense graph”, $m$ is closer to $\Theta(n^2)$.
What to use? Adjacency list or matrix?
Answer: depends on the density and operations needed.
While there are more than 2 vertices
Pick a remaining edge (u, v) uniformly at random
Merge (concat) u and v into a single vertex
Remove self loops
Return cut represented by 2 final vertices.
Question: what is the probability of success?
Fix a graph $G = (V, E)$ with $n$ vertices, $m$ edges. Fix a minimum cut $(A, B)$. Let $k$ be the number of edges crossing $(A, B)$ - call these edges $F$.
What could go wrong?
Thus,
$$ \mathbb{P}[output\ is\ (A, B)] = \mathbb{P}[never\ contracts\ an\ edge\ of\ F] $$
Let $S_i$ be the event that an edge of $F$ contracted in interation $i$. Goal: Compute $\mathbb{P}[\lnot S_1 \cap \lnot S_2 \cap ..\cap\lnot S_{n - 2}]$
Quiz: what is the probability that an edge crossing the minimum cut $(A, B)$ is chosen in the first iteration (as a function of the number of vertices $n$, the number of edges $m$, and the number $k$ of crossing edges)? Answer: $\frac{k}{n}$.
Key observation: degree of each vertex is at least $k$.
Reason: each vertex $u$ defines a cut $({u}, V-{u})$.
Since
$$ \sum_v\deg(v)=2m \geq kn $$
we have
$$ m \geq \frac{kn}{2} $$
Since
$$ \mathbb{P}[S_1]=\frac{k}{m}, \mathbb{P}[S_1] \leq \frac{2}{n} $$
Recall:
$$ \mathbb{P}[\lnot S_1\cap\lnot S_2]=\mathbb{P}[\lnot S_2 \mid\lnot S_1]\mathbb{P}[\lnot S_1] $$
where
$$ \mathbb{P}[\lnot S_2 \mid\lnot S_1]=1-\frac{K}{# remaining\ edges} $$
and
$$ \mathbb{P}[\lnot S_1] \geq 1-\frac{2}{n} $$
Notes: all nodes in contracted graph define cuts in $G$ (with at least $k$ crossing edges). Therefore, all degrees in contracted graph are at least $k$.
So, number of remaining edges $\geq \frac{1}{2}K(n-1)\Rightarrow \mathbb{P}[\lnot S_2\mid\lnot S_1]\geq 1-\frac{2}{n-1}$
In general:
$$ \begin{aligned} \mathbb{P}[\lnot S_1 \cap \lnot S_2 \cap ..\cap\lnot S_{n - 2}]&=\mathbb{P}[\lnot S_1]\mathbb{P}[\lnot S_2\mid\lnot S_1]\mathbb{P}[\lnot S_3\mid\lnot S_1\cap\lnot S_2] .. \mathbb{P}[\lnot S_{n -2}\mid\lnot S_1..\lnot S_{n - 3}]\ &\geq(1-\frac{2}{n})(1-\frac{2}{n-1})..(1-\frac{1}{n-(n-3)})\ &\geq\frac{2}{n(n-1)}\geq\frac{1}{n^2} \end{aligned} $$
Problem: low success probability! - recall approx. $2^n$ cuts.
Solution: run the basic algorithm a large number $N$ times, remember the smallest cut found.
How many times needed?
Let $T_i$ be the event that the cut $(A, B)$ is found on the $i^{th}$ try. By definition, different $T_i$ are independent.
By independence,
$$ \begin{aligned} \mathbb{P}[all\ N\ trials\ fail]&=\mathbb{P}[\lnot T_1 \cap \lnot T_2\cap .. \cap\lnot T_n]\ &=\prod_{i=1}^N\mathbb{P}[\lnot T_i] \leq (1-\frac{1}{N})^N \end{aligned} $$
So, if we take $N=n^2$, $\mathbb{P}[all\ trial\ fail]\leq (e^{-\frac{1}{n^2}})^{n^2}=\frac{1}{e}$.
If we take $N = n^2\ln(n)$, $\mathbb{P}[all\ trial\ fail]\leq (\frac{1}{e})^{\ln(n)}=\frac{1}{n}$
Running time: polynomial in $n$ and $m$ but slow $\Omega(n^2m)$ but can get big speedup (to roughly $\mathcal{O}(n^2)$) with more ideas.
Note: a graph can have multiple min cuts. What is the largest number of min cut that a graph with $n$ vertices can have? Answer: ${n\choose 2}=\frac{n(n-1)}{2}$
Consider the $n$-cycle. Each pair of the $n$ edges defines a distinct minimum cut (with two crossing edges). Therefore, has $\geq {n \choose 2}$ min cuts.
Let $(A_1, B_1), (A_2, B_2), .., (A_t, B_t)$ be the min cuts of a graph with $n$ vertices. By the contraction algorithm analysis (without repeated trials):
$$ \mathbb{P}[output=(A_i, B_i)]\geq\frac{2}{n(n-1)}=\frac{1}{n \choose 2},\forall i\in[1, t] $$
Note: $S_i$ are disjoint events (i.e. only one can happen). Therefore their probabilities sum to at most 1.
Thus:
$$ \frac{t}{n \choose 2} \leq 1 \Rightarrow t \leq {n \choose 2} $$
How many different minimum cuts are there in a tree with $n$ nodes?
Answer: $n - 1$
Let “output” denote the cut output by Karger’s min cut algorithm on a given connected graph with $n$ vertices, and let $\displaystyle p=\frac{1}{n \choose 2}$. Which of the following statements are true?
Answer:
Let $0.5<\alpha<1$ be some constant. Suppose you are looking for the median element in an array using RANDOMIZED SELECT (as explained in lectures). What is the probability that after the first iteration the size of the subarray in which the element you are looking for lies is $\leq \alpha$ times the size of the original array?
Answer: $2\alpha - 1$
Let $0<\alpha<1$ be a constant, independent of $n$. Consider an execution of RSelect in which you always manage to throw out at least a $1−\alpha$ fraction of the remaining elements before you recurse. What is the maximum number of recursive calls you’ll make before terminating?
Answer: $-\frac{\log(n)}{\log(\alpha)}$
The minimum $s-t$ cut problem is the following. The input is an undirected graph, and two distinct vertices of the graph are labelled $s$ and $t$. The goal is to compute the minimum cut (i.e., fewest number of crossing edges) that satisfies the property that s and t are on different sides of the cut. Suppose someone gives you a subroutine for this $s-t$ minimum cut problem via an API. Your job is to solve the original minimum cut problem (the one discussed in the lectures), when all you can do is invoke the given min $s-t$ cut subroutine. (That is, the goal is to reduce the min cut problem to the min $s-t$ cut problem.) Now suppose you are given an instance of the minimum cut problem – that is, you are given an undirected graph (with no specially labelled vertices) and need to compute the minimum cut. What is the minimum number of times that you need to call the given min $s-t$ cut subroutine to guarantee that you’ll find a min cut of the given graph?
Answer: $n - 1$