Coursera

Week 4

Linear-time Selection

The problem

Input: given an array $A$ with $n$ distinct numbers and a number $i \in {1, 2, .., n}$.

Output: $i^{th}$ order statistic (i.e. the $i^{th}$ smallest element of the array).

Example: median ($i = \frac{n + 1}{2}$ for $n$ is odd, $i = \frac{n}{2}$ for $n$ is even).

Reduction to Sorting

$\mathcal{O}(n\log(n))$ algorithm:

Apply Merge Sort
Return the $i^{th}$ element of sorted array

Fact: can’t sort array faster.

Next: $\mathcal{O}(n)$ time (randomized) by modifying Quick Sort.

Partitioning around a pivot

Key idea: partitioning array around a pivot element.

Pick element of an array
Rearrange the array (consider sorting array ascending) so that
- Left of pivot: less than pivot
- Right of pivot: greater than pivot

Note: puts pivot in its “rightful position”.

Question: Suppose we are looking for the 5th order statistic in an input array of length 10. We partition the array, and the pivot winds up in the third position of the partitioned array. On which side of the pivot do we recurse, and what order statistic should we look for?

Answer: the 2nd element order statistic on the right side of pivot.

Randomized Selection

RSelect(array A, length n, order statistic i)
	(0) if n = 1 return A[1]
	(1) Choose pivot p from A uniformly at random
	(2) Partition  A around p
		Let j = new index of p
	(3) if j = i return p
	(4) if j > i return RSelect(1st part of A, j - 1, i)
	(5) if j < i return RSelect(2nd part of A, n - j, i - j)

Claim: RSelect is correct (guaranteed to output $i^{th}$ order statistics). Proof by induction.

Running time of RSelect

Running time? depends on the “quality” of the chosen pivots.

Question: what is the running time of RSelect if pivots are always chosen in the worst possible way? Answer: $\Theta(n^2)$.

Example:

Suppose $i = \frac{n}{2}$
Suppose choose pivot = minimum every time.

Hence, $\Omega(n)$ time in each of $\Omega(n)$ recursive calls.

Key: find pivots giving “balanced” split.

Best pivot: the median - but this is circular. Hence, the recurrence $T(n)$ can be estimated via the master method:

$$ T(n) \leq T\left(\frac{n}{2}\right) + \mathcal{O}(n) \Rightarrow T(n) \in \mathcal{O}(n) $$

Hope: random pivot is “pretty good” “often enough”.

RSelect Theorem: for every input array of size $n$, the average running time of RSelect is $\mathcal{O}(n)$.

Holds for every input (no assumptions on data)
“average” is over random pivot choices made by the algorithm.

Proof

Proof 1: Tracking Progress via Phases

Note: RSelect use $\leq Cn$ operations outside of the recursive call (for some constants $C > 0$) - from partitioning.

Notation: RSelect is in phase $j$ iff current array size between $\left(\frac{3}{4}\right)^{j + 1}n$ and $\left(\frac{3}{4}\right)^{j}n$.

Let $X_j$ is the number of recursive calls during phase $j$. Hence, running time of RSelect $\leq \sum_{\text{phases} j}X_j\times c \times \left(\frac{3}{4}\right)^{j}n$, where

$\left(\frac{3}{4}\right)^{j}n \leq$ array size during phase $j$
$c \times \left(\frac{3}{4}\right)^{j}n$ is works per phase $j$ subproblem

Proof 2: Reduction to Coin Flipping

Reminding: $X_j$ is the number of recursive calls during phase $j$ - size between $\left(\frac{3}{4}\right)^{j + 1}n$ and $\left(\frac{3}{4}\right)^{j}n$.

Note: if RSelect chooses a pivot giving a 25-75 split (or better) then current phase ends (new subarray length at most 75% of length).

Recall: probability of 25-75 split or better is 50%.

So, $\mathbb{E}[X] \leq$ expected number of times you need to flip a fair coin to get one “heads” (heads $\approx$ good pivot, tails $\approx$ bad pivot).

Proof 3: Coin Flipping Analysis

Let $N$ is the number of coin flips until you get heads. (a “geometric random variable”).

Note: $\mathbb{E}[N] = 1 + \frac{1}{2}\cdot\mathbb{E}[N]$ where

1: 1st coin flip
$\frac{1}{2}$: probability of tails
$\mathbb{E}[N]$: number of further coin flips needed in this case.

Solution: $\mathbb{E}[N] = 2$ (Recall $\mathbb{E}[X_j] \leq \mathbb{E}[N]$).

Putting it all together

$$ \begin{align} \text{Expected running time of RSelect} &\leq Cn \sum_{\text{phase} j} \left(\frac{3}{4}\right)^jX_j \ &=Cn\sum_{\text{phase} j}\left(\frac{3}{4}\right)^j\mathbb{E}[X_j]\ &\leq 2Cn\sum_{\text{phase} j}\left(\frac{3}{4}\right)^j \end{align} $$

Recall: $\sum_{\text{phase} j}\left(\frac{3}{4}\right)^j$ is a geometric sum, $\leq \frac{1}{1 - \frac{3}{4}} = 4$.

Hence,

$$ \text{Expected running time of RSelect} \leq 8Cn \in \mathcal{O}(n) $$

Deterministic Selection

Guarenteeing a Good Pivot

Recall: “best” pivot - the median.

Goal: find pivot guaranteed to be pretty good.

Key idea: median of median.

A deterministic ChoosePivot

ChoosePivot(A, n)
	Logically break A into n/5 groups of size 5 each.
	Sort each group (e.g. using Merge Sort)
	Copy n/5 medians (i.e. middle element of each sorted group) into new array C
	Recursively compute median of C
	Return this as pivot

The DSelect Algorithm’s pseudocode:

DSelect(array A, length n, order statistic i)
	Break A into groups of 5, sort each group
	C = the n/5 middle elements
	p = Select(C, n/5, n/10) // recursively compute meadian of C
	Partition A around p
	if i = j return p
	else if j < i return Select(first part of A, j - 1, i)
	else return Select(second part of A, n - j, i - j)

We can easily see, DSelect makes at most 2 recursive calls. Running time of DSelect (DSelectTheorem): for every input of length $n$, DSelect runs in $\mathcal{O}(n)$ time.

Howerer, DSelect is not as good as RSelect in practice

Worse constants
Not in-place

History: from 1973., Blum-Floyd-Pratt-Rivest-Tarjan

Analysis

What is the asymtotic running time of step 1 of DSelect? $\Theta(n)$

Explanation: sorting an array of 5 elements takes $\leq 120$ operations. With Merge Sort’s $6m(\log_2m+1) \leq \frac{n}{5}120 = 24n \in \mathcal{O}(n)$.

Let $T(n)$ is the maximum running time of DSelect on an input array of length $n$. There is a constant $c \geq 1$ such that

$T(1)=1$
$T(n) \leq cn+T(\frac{n}{5})+T(?)$

Where

$cn$: sorting the groups and partition
$T(\frac{n}{5})$: recursive call in line 3
$T(?)$: recursive call in line 6 or 7

Key lemma: 2nd recursive call (in line 6 or 7) guaranteed to be on an array of size $\leq \frac{7}{10}n$ (roughly)

Upshot: can replace “?” by $\frac{7}{10}n$.

Rough proof:

Let $k= \frac{n}{5}=$ number of groups Let $x_i=i^{th}$ smallest of the $k$ “middle elements” (so pivot = $\frac{k}{2}$).

Goal: at least 30% of input array smaller than $X_{\frac{k}{2}}$, $\geq$ 30% is bigger.

$T(1)=1$
$T(n) \leq cn+T(\frac{n}{5})+T(\frac{7}{10}n)$

Note: different-sized subproblems can’t use master method.

Strategy: Hope and check, that some constant $a$ (independent of $n$) such that $T(n) \leq an\ \forall n \geq 1$. If true, then $T(n) \in \mathcal{O}(n)$ and algorithm is linear.

Proof: let $a = 10c$, then $T(n) \leq an\ \forall n \geq 1$ (induction by $n$).

Base case: $T(1) = 1 \leq a$

Inductive step: forall $n > 1$,

$$ \begin{aligned} T(n) &\leq cn + T(\frac{n}{5}) + T(\frac{7}{10}n)\ &\leq cn + a(\frac{n}{5}) + a(\frac{7}{10}n)\ &\leq n(c + \frac{9}{10}a)=an \end{aligned} $$

A sorting lower bound

I.e. why sorting can’t faster than $n\log n$

Theorem: every “comparison-based” sorting algorithm has worst-case running time $\Omega(n\log n)$ (assume deterministic, but lower bound extends to randomized).

Comparison-based sort: accesses input array elements only via comparisons aka general-purpose sorting method.

Examples: Merge Sort, Quick Sort, Heap Sort

Non-examples: Bucket Sort (good for data from distributions), Counting Sort (good for small integers), Radix Sort

Proof idea: fix a comparison-based sorting method and an array length $n$. Suppose algorithm always makes $\leq k$ comparisons to correctly sort these $n!$ inputs. That means across all $n!$ possible inputs, algorithm exhibits $\leq 2^k$ distinct executions.

By the pigeon’s hole principle:

If $2^k \leq n!$, execute identically on two distinct inputs must get one of them incorrect. Therefore, since the method is correct,

$$ \begin{aligned} 2^k &\geq n!\ &\geq (\frac{n}{2})^{\frac{n}{2}}\ \Rightarrow k &\geq \frac{n}{2}log_2(\frac{n}{2}) \in \Omega(n\log n) \end{aligned} $$

Graphs and Minimum Cut

Graphs

Ingredients:

Vertices and edges
Edges = pairs of vertices.
- Can be undirected (unordered pair) or directed (ordered pair)

Cuts of graphs: a cut of a graph $(V, E)$ is a partition of $V$ into two non-empty sets $A$ and $B$.

The crossing edges of a cut $(A, B)$ are those with

One endpoint in each of $(A, B)$ (undirected)
Fail in $A$, head in $B$ (directed)

Quiz: how many cuts does a graph with $n$ vertices have? Answer: $2^n$.

The minimum cut problem

Input: an undirected graph $G = (V, E)$

Goal: compute a cut with fewest number of crossing edges (a min cut).

Applications:

Identify network bottlenecks / weaknesses
Community detection in social networks
Image segmentation
- Input = graph of pixels
- Use edge weights

Sparse vs. Dense graphs

Let $n$ be number of vertices, $m$ be number of edges. In most (but not all) applications, $m$ is $\Omega(n)$ and $\mathcal{O}(n^2)$.
In a “sparse graph”, $m$ is $\mathcal{O}(n)$ or close to it.
In a “dense graph”, $m$ is closer to $\Theta(n^2)$.

What to use? Adjacency list or matrix?

Answer: depends on the density and operations needed.

Random Contraction Algorithm

While there are more than 2 vertices
	Pick a remaining edge (u, v) uniformly at random
	Merge (concat) u and v into a single vertex 
	Remove self loops
Return cut represented by 2 final vertices.

Analysis of the Random Contraction Algorithm

Question: what is the probability of success?

Fix a graph $G = (V, E)$ with $n$ vertices, $m$ edges. Fix a minimum cut $(A, B)$. Let $k$ be the number of edges crossing $(A, B)$ - call these edges $F$.

What could go wrong?

Suppose an edge of $F$ is contracted at some point. Therefore, algorithm will not output $(A, B)$.
Suppose only edges inside $A$ or inside $B$ get contracted. Therefore, algorithm will output $(A, B)$.

Thus,

$$ \mathbb{P}[output\ is\ (A, B)] = \mathbb{P}[never\ contracts\ an\ edge\ of\ F] $$

Let $S_i$ be the event that an edge of $F$ contracted in interation $i$. Goal: Compute $\mathbb{P}[\lnot S_1 \cap \lnot S_2 \cap ..\cap\lnot S_{n - 2}]$

Quiz: what is the probability that an edge crossing the minimum cut $(A, B)$ is chosen in the first iteration (as a function of the number of vertices $n$, the number of edges $m$, and the number $k$ of crossing edges)? Answer: $\frac{k}{n}$.

The first iteration

Key observation: degree of each vertex is at least $k$.

Reason: each vertex $u$ defines a cut $({u}, V-{u})$.

Since

$$ \sum_v\deg(v)=2m \geq kn $$

we have

$$ m \geq \frac{kn}{2} $$

Since

$$ \mathbb{P}[S_1]=\frac{k}{m}, \mathbb{P}[S_1] \leq \frac{2}{n} $$

The second iteration

Recall:

$$ \mathbb{P}[\lnot S_1\cap\lnot S_2]=\mathbb{P}[\lnot S_2 \mid\lnot S_1]\mathbb{P}[\lnot S_1] $$

where

$$ \mathbb{P}[\lnot S_2 \mid\lnot S_1]=1-\frac{K}{# remaining\ edges} $$

and

$$ \mathbb{P}[\lnot S_1] \geq 1-\frac{2}{n} $$

Notes: all nodes in contracted graph define cuts in $G$ (with at least $k$ crossing edges). Therefore, all degrees in contracted graph are at least $k$.

So, number of remaining edges $\geq \frac{1}{2}K(n-1)\Rightarrow \mathbb{P}[\lnot S_2\mid\lnot S_1]\geq 1-\frac{2}{n-1}$

All iterations

In general:

$$ \begin{aligned} \mathbb{P}[\lnot S_1 \cap \lnot S_2 \cap ..\cap\lnot S_{n - 2}]&=\mathbb{P}[\lnot S_1]\mathbb{P}[\lnot S_2\mid\lnot S_1]\mathbb{P}[\lnot S_3\mid\lnot S_1\cap\lnot S_2] .. \mathbb{P}[\lnot S_{n -2}\mid\lnot S_1..\lnot S_{n - 3}]\ &\geq(1-\frac{2}{n})(1-\frac{2}{n-1})..(1-\frac{1}{n-(n-3)})\ &\geq\frac{2}{n(n-1)}\geq\frac{1}{n^2} \end{aligned} $$

Problem: low success probability! - recall approx. $2^n$ cuts.

Repeated trials

Solution: run the basic algorithm a large number $N$ times, remember the smallest cut found.

How many times needed?

Let $T_i$ be the event that the cut $(A, B)$ is found on the $i^{th}$ try. By definition, different $T_i$ are independent.

By independence,

$$ \begin{aligned} \mathbb{P}[all\ N\ trials\ fail]&=\mathbb{P}[\lnot T_1 \cap \lnot T_2\cap .. \cap\lnot T_n]\ &=\prod_{i=1}^N\mathbb{P}[\lnot T_i] \leq (1-\frac{1}{N})^N \end{aligned} $$

So, if we take $N=n^2$, $\mathbb{P}[all\ trial\ fail]\leq (e^{-\frac{1}{n^2}})^{n^2}=\frac{1}{e}$.

If we take $N = n^2\ln(n)$, $\mathbb{P}[all\ trial\ fail]\leq (\frac{1}{e})^{\ln(n)}=\frac{1}{n}$

Running time: polynomial in $n$ and $m$ but slow $\Omega(n^2m)$ but can get big speedup (to roughly $\mathcal{O}(n^2)$) with more ideas.

Counting minimum cuts

Note: a graph can have multiple min cuts. What is the largest number of min cut that a graph with $n$ vertices can have? Answer: ${n\choose 2}=\frac{n(n-1)}{2}$

The lower bound

Consider the $n$-cycle. Each pair of the $n$ edges defines a distinct minimum cut (with two crossing edges). Therefore, has $\geq {n \choose 2}$ min cuts.

The upper bound

Let $(A_1, B_1), (A_2, B_2), .., (A_t, B_t)$ be the min cuts of a graph with $n$ vertices. By the contraction algorithm analysis (without repeated trials):

$$ \mathbb{P}[output=(A_i, B_i)]\geq\frac{2}{n(n-1)}=\frac{1}{n \choose 2},\forall i\in[1, t] $$

Note: $S_i$ are disjoint events (i.e. only one can happen). Therefore their probabilities sum to at most 1.

Thus:

$$ \frac{t}{n \choose 2} \leq 1 \Rightarrow t \leq {n \choose 2} $$

Quiz

Question 1

How many different minimum cuts are there in a tree with $n$ nodes?

Answer: $n - 1$

Question 2

Let “output” denote the cut output by Karger’s min cut algorithm on a given connected graph with $n$ vertices, and let $\displaystyle p=\frac{1}{n \choose 2}$. Which of the following statements are true?

Answer:

For every graph $G$ with $n$ nodes and every min cut $(A, B)$ of $G$, $Pr[out=(A, B)] \geq p$
There exists a graph $G$ with $n$ nodes and a min cut $(A, B)$ of $G$ such that $Pr[out=(A, B)] \leq p$
For every graph $G$ with $n$ nodes, there exists a min cut $(A, B)$ of $G$ such that $Pr[out=(A, B)] \geq p$

Question 3

Let $0.5<\alpha<1$ be some constant. Suppose you are looking for the median element in an array using RANDOMIZED SELECT (as explained in lectures). What is the probability that after the first iteration the size of the subarray in which the element you are looking for lies is $\leq \alpha$ times the size of the original array?

Answer: $2\alpha - 1$

Question 4

Let $0<\alpha<1$ be a constant, independent of $n$. Consider an execution of RSelect in which you always manage to throw out at least a $1−\alpha$ fraction of the remaining elements before you recurse. What is the maximum number of recursive calls you’ll make before terminating?

Answer: $-\frac{\log(n)}{\log(\alpha)}$

Question 5

The minimum $s-t$ cut problem is the following. The input is an undirected graph, and two distinct vertices of the graph are labelled $s$ and $t$. The goal is to compute the minimum cut (i.e., fewest number of crossing edges) that satisfies the property that s and t are on different sides of the cut. Suppose someone gives you a subroutine for this $s-t$ minimum cut problem via an API. Your job is to solve the original minimum cut problem (the one discussed in the lectures), when all you can do is invoke the given min $s-t$ cut subroutine. (That is, the goal is to reduce the min cut problem to the min $s-t$ cut problem.) Now suppose you are given an instance of the minimum cut problem – that is, you are given an undirected graph (with no specially labelled vertices) and need to compute the minimum cut. What is the minimum number of times that you need to call the given min $s-t$ cut subroutine to guarantee that you’ll find a min cut of the given graph?

Answer: $n - 1$