Coursera

Kruskal’s MST Algorithm

MST review

Input: undirected graph $G = (V, E)$, edge costs $c_e$

Output: min-cost spanning tree (no cycles, connected)

Assumptions: $G$ is connected; distinct edge costs

Cut property: if $e$ is the cheapest edge crossing some cut $(A, B)$, then $e$ belongs to the MST.

Kruskal’s MST algorithm

1. sort edges in order of increasing cost (rename edges 1, 2, 3, .., n so that c_1 < c_2 < .. < c_n)
2. T = empty
3. for i = 1 to m
	if T union {i} has no cycles
		add i to T
4. return T

Correctness of Kruskal’s algorithm

Theorem: Kruskal’s algorithm is correct.

Proof: let $T^*$ is the output of Kruskal’s algorithm on input graph $G$.

Clearly $T^*$ has no cycles.
$T^*$ is connected. Why?
1. By Empty Cut Lemma (revise [[[courses/algorithms/algorithms-greedy/week1/notes|notes]]), only need to show that $T^*$ crosses every cut.
2. Fix a cut $(A, B)$. Since $G$ is connected, at least one of its edges crosses $(A, B)$. Key point: Kruskal will include first edge crossing $(A, B)$ that it sees (by Lonely Cut Corollary, cannot create a cycle)
Every edge of $T^$ justified by the cut property (implies $T^$ is the MST) Reason: Consider iteration where edge cut $(u, v)$ added to current set $T$. Since $T \cup {(u, v)}$ has no cycle, $T$ has no $u-v$ path (as in proof of Empty Cut Lemma). Therefore, exists an empty cut $(A, B)$ separating $u$ and $v$. Hence, by 2b no edge crossing $(A, B)$ were previously considered by Kruskal’s algorithm, and $(u, v)$ is the first (hence the deepest) edge crossing $(A, B)$, $(u, v)$ justified by the Cut Property.

Implementing Kruskal’s Algorithm via Union-Find

Running time of straightforward implementation

$\mathcal{O}(m\log{n})$ for sorting all edges (where $m$ is number of edges, $n$ is number of vertices) Recall that $m \in \mathcal{O}(n^2)$ assuming no parallel edges.
$\mathcal{O}(m)$ iterations in the main loop.
$\mathcal{O}(n)$ time to check for cycle – use BFS or DFS in the graph $(V, T)$ contains $\leq (n - 1)$ edges. Therefore, overall running time

$$ \mathcal{O}(m\log{n}) + \mathcal{O}(mn) = \mathcal{O}(mn) $$ Can we speed that up?

Plan: data structure for $\mathcal{O}(1)$ cycle check, which makes total $\mathcal{O}(m\log{n})$ time.

Union-Find data structure

Raison d’être of a union-find data structure: maintain partition of a set of objects.

Operations:

$Find(x)$: given an object $x$, return the name of the group that $x$ belongs to.
$Union(c_i, c_j)$: fuse groups $c_i$ and $c_j$ into a single one.

Why useful for Kruskal’s algorithm?

Objects: vertices
Groups: connected components which respects to currently chosen edges $T$
Adding new edge $(u, v)$ to $T$ also means fusing connected components of $u, v$

Union-Find basics

Idea number 1

maintain one linked structure per connected component of $(V, T)$
each component has an arbitrary leader vertex.

Invariant: each vertex points to the leader vertex of its component (“name” of a component inherited from leader vertex).

Key point: given edge $(u, v)$, can check if $u$ and $v$ already in same component in $\mathcal{O}(1)$ time (if and only if leader pointers of $u$ and $v$ match i.e. $Find(u) = Find(v)$).

Hence, $\mathcal{O}(1)$ time cycle checks.

Maintaining the invariant

Note: when new edge $(u, v)$ added to $T$, connected components of $u$ and $v$ merge.

Question: how many leader pointer updates are needed to restore the invariant in the worst case? Answer: $\Theta(n)$ e.g. when merging two components with $\frac{n}{2}$ vertices each.

Idea number 2

When two components merge, have smaller one inherit the leader of the larger one (easy to maintain a size field in each component to facilitate this).

Question: how many leader pointer updates are now needed to restore the invariant in the worst case? Answer: still $\Theta(n)$, same reason with the previous [[#Maintaining the invariant]] quiz.

Updating leader pointers

But: how many times does a single vertex have its leader pointer update over the course of Kruskal’s algorithm? Answer: $\Theta(\log{n})$

Reason: every time $v$’s leader pointer gets updated, population of its component at least doubles. This can only happen $\leq \log_2{n}$ times.

Running time of fast implementation

Scorecard:

$\mathcal{O}(m\log{n})$ time for sorting
$\mathcal{O}(m)$ time for cycle checks ($\mathcal{O}(1)$ per iteration)
$\mathcal{O}(n\log{n})$ time overall for leader pointer updates

Therefore: $\mathcal{O}(m\log{n})$ total.

SOTA and Open Questions

Question

Biggest question: can we do better than $\mathcal{O}(m\log{n})$ (running time for Prim + Kruskal)?

Answer: yes.

$\mathcal{O}(m)$ randomized algorithm (Karger-Klein-Tarjan JACM 1995)
$\mathcal{O}(m\alpha(n))$ deterministic (Chazelle JACM 2000) where $\alpha(n)$ is the “inverse Ackermann funciton”. In particular, grows much slower than $\log^*{n}:=$ number of times you can apply $\log$ to $n$ until result drops below 1 (inverse of “tower function” $2^{2^{2^{2^{..}}}}$)

Open questions

Weirdest of all (Pettie/Ramachandran JACM 2002): optimal deterministic MST algorithm, but precise asymptotic running time is unknown! (between $\Theta(m)$ and $\Theta(m\alpha(n))$, but don’t know where).

Yet, there are open questions:

Simple randomized $\mathcal{O}(m)$ time algorithm for MST. (sufficient: do this just for the “MST verification” problem).
Is there a deterministic $\mathcal{O}(m)$ time algorithm?

MST application to Clustering

Clustering

Informal goal: given $n$ points, classify into “coherent groups”.

Assumptions:

as input, given a (dis)similarity measure - a distance $d(p, q)$ between each point pair.
symmetric (i.e. $d(p, q) = d(q, p)$)

Examples: Euclidean distance, genome similarity, etc.

Goal: same cluster = nearby.

Max-spacing k-clustering

Assume: we know $k$ is number of clusters desired (in practice, can experiment with a range of values).

Call points $p$ and $q$ separated if they’re assigned to different clusters.

Definition: the spacing of a $k$-clustering is

$$\min_{\text{separated } p, q} d(p, q)$$ (the bigger, the better)

Problem statement: given a distance measure $d$ and $k$, compute the $k$-clustering with maximum spacing.

A Greedy Algorithm

- initially, each point in a separate cluster
- repeat until only k clusters:
	- let p, q be the closest pair of separated points (determines the current spacing)
	- merge the clusters containing p and q into a single cluster

Note: just like Kruskal’s MST algorithm, but stopped early.

points = vertices
distances = edge costs
point pairs = edges

This called single-link clustering

Correctness

Claim

Theorem: single-link clustering finds the max-spacing $k$-clustering

Proof: let $C_1, C_2, .., C_k$ be the greedy clustering with spacing $S$. Let $\hat{C_1}, \hat{C_2}, .., \hat{C_k}$ be the arbitrary other clustering. We need to show the spacing of $\hat{C_1}, \hat{C_2}, .., \hat{C_k} \leq S$

Case 1: $\hat{C_i}$’s are the same as the $C_i$’s (maybe after renaming), hence this has the same spacing $S$.
Case 2: otherwise, can find a point pair $p, q$ such that
- $p, q$ in the same greedy cluster $C_i$
- $p, q$ in different clusters $\hat{C_i},\hat{C_j}$ Property of greedy algorithm: if two points $x, y$ “directly merged” at some point, then $d(x, y) \leq S$ (distance between point pairs only goes up).

[!tips]- Easy case: $p, q$ directly merged at some point if $p, q$ directly merged at some point, $S \geq d(p, q) \geq$ spacing of $\hat{C_1}, .. \hat{C_k}$

[!tips]- Tricky case: $p, q$ “indirectly merged” through multiple direct merges. Let $p, a_1, .., a_l, q$ be the path of direct greedy merges connecting $p$ and $q$. Key point: Since $p \in \hat{C_i}$ and $q \notin \hat{C_i}$, therefore exists consecutive pair $a_j, a_{j + 1}$ with $a_j \in \hat{C_i}, a_{j + 1}\notin \hat{C_i}$ such that $S \geq d(a_j, a_{j + 1}) \geq$ spacing of $\hat{C_1}, .., \hat{C_k}$

Advanced Union-Find

TODO

Quiz

Question 1

Suppose we are given a directed graph $G=(V,E)$ in which every edge has a distinct positive edge weight. A directed graph is acyclic if it has no directed cycle. Suppose that we want to compute the maximum-weight acyclic subgraph of $G$ (where the weight of a subgraph is the sum of its edges’ weights). Assume that $G$ is weakly connected, meaning that there is no cut with no edges crossing it in either direction.

Here is an analog of Prim’s algorithm for directed graphs. Start from an arbitrary vertex $s$, initialize $S={s}$ and $F=\emptyset$. While $S\neq V$, find the maximum-weight edge $(u,v)$ with one endpoint in $S$ and one endpoint in $V−S$. Add this edge to $F$, and add the appropriate endpoint to $S$.

Here is an analog of Kruskal’s algorithm. Sort the edges from highest to lowest weight. Initialize $F=\emptyset$. Scan through the edges; at each iteration, add the current edge $i$ to $F$ if and only if it does not create a directed cycle.

Which of the following is true?

[c] Only the modification of Kruskal’s algorithm always computes a maximum-weight acyclic subgraph.
[p] Both algorithms might fail to compute a maximum-weight acyclic subgraph.
[c] Only the modification of Prim’s algorithm always computes a maximum-weight acyclic subgraph.
[c] Both algorithms always compute a maximum-weight acyclic subgraph.

Question 2

Consider a connected undirected graph $G$ with edge costs that are not necessarily distinct. Suppose we replace each edge cost $c_e$ by $-c_e$; call this new graph $G’$. Consider running either Kruskal’s or Prim’s minimum spanning tree algorithm on $G’$, with ties between edge costs broken arbitrarily, and possibly differently, in each algorithm. Which of the following is true?

[c] Both algorithms compute the same maximum-cost spanning tree of $G$
[c] Prim’s algorithm computes a maximum-cost spanning tree of $G$ but Kruskal’s algorithm might not.
[c] Kruskal’s algorithm computes a maximum-cost spanning tree of $G$ but Prim’s algorithm might not.
[p] Both algorithms compute a maximum-cost spanning tree of $G$, but they might compute different ones.

Question 3

Consider the following algorithm that attempts to compute a minimum spanning tree of a connected undirected graph $G$ with distinct edge costs. First, sort the edges in decreasing cost order (i.e., the opposite of Kruskal’s algorithm). Initialize $T$ to be all edges of $G$. Scan through the edges (in the sorted order), and remove the current edge from $T$ if and only if it lies on a cycle of $T$.

Which of the following statements is true?

[c] The output of the algorithm will never have a cycle, but it might not be connected.
[p] The algorithm always outputs a minimum spanning tree.
[c] The output of the algorithm will always be connected, but it might have cycles.
[c] The algorithm always outputs a spanning tree, but it might not be a minimum cost spanning tree.

Question 4

Consider an undirected graph $G=(V,E)$ where edge $e\in E$ has cost $c_e$. A minimum bottleneck spanning tree $T$ is a spanning tree that minimizes the maximum edge cost $\max_{e\in T}c_e$. Which of the following statements is true? Assume that the edge costs are distinct.

[c] A minimum bottleneck spanning tree is not always a minimum spanning tree and a minimum spanning tree is not always a minimum bottleneck spanning tree.
[c] A minimum bottleneck spanning tree is always a minimum spanning tree and a minimum spanning tree is always a minimum bottleneck spanning tree.
[c] A minimum bottleneck spanning tree is always a minimum spanning tree but a minimum spanning tree is not always a minimum bottleneck spanning tree.
[p] A minimum bottleneck spanning tree is not always a minimum spanning tree, but a minimum spanning tree is always a minimum bottleneck spanning tree.

Question 5

You are given a connected undirected graph $G$ with distinct edge costs, in adjacency list representation. You are also given the edges of a minimum spanning tree $T$ of $G$. This question asks how quickly you can re-compute the MST if we change the cost of a single edge. Which of the following are true? (RECALL: It is not known how to deterministically compute an MST from scratch in $\mathcal{O}(m)$ time, where $m$ is the number of edges of $G$.)

[x] Suppose $e \notin T$ and we increase the cost of $e$. Then, the new MST can be recomputed in $\mathcal{O}(m)$ deterministic time.
[x] Suppose $e\in T$ and we increase the cost of $e$. Then, the new MST can be recomputed in $\mathcal{O}(m)$ deterministic time.
[x] Suppose $e\in T$ and we decrease the cost of $e$. Then, the new MST can be recomputed in $\mathcal{O}(m)$ deterministic time.
[x] Suppose $e \notin T$ and we decrease the cost of $e$. Then, the new MST can be recomputed in $\mathcal{O}(m)$ deterministic time.