Coursera

Kruskal’s MST Algorithm

MST review

Input: undirected graph $G = (V, E)$, edge costs $c_e$

Output: min-cost spanning tree (no cycles, connected)

Assumptions: $G$ is connected; distinct edge costs

Cut property: if $e$ is the cheapest edge crossing some cut $(A, B)$, then $e$ belongs to the MST.

Kruskal’s MST algorithm

1. sort edges in order of increasing cost (rename edges 1, 2, 3, .., n so that c_1 < c_2 < .. < c_n)
2. T = empty
3. for i = 1 to m
	if T union {i} has no cycles
		add i to T
4. return T

Correctness of Kruskal’s algorithm

Theorem: Kruskal’s algorithm is correct.

Proof: let $T^*$ is the output of Kruskal’s algorithm on input graph $G$.

  1. Clearly $T^*$ has no cycles.
  2. $T^*$ is connected. Why?
    1. By Empty Cut Lemma (revise [[[courses/algorithms/algorithms-greedy/week1/notes|notes]]), only need to show that $T^*$ crosses every cut.
    2. Fix a cut $(A, B)$. Since $G$ is connected, at least one of its edges crosses $(A, B)$. Key point: Kruskal will include first edge crossing $(A, B)$ that it sees (by Lonely Cut Corollary, cannot create a cycle)
  3. Every edge of $T^$ justified by the cut property (implies $T^$ is the MST) Reason: Consider iteration where edge cut $(u, v)$ added to current set $T$. Since $T \cup {(u, v)}$ has no cycle, $T$ has no $u-v$ path (as in proof of Empty Cut Lemma). Therefore, exists an empty cut $(A, B)$ separating $u$ and $v$. Hence, by 2b no edge crossing $(A, B)$ were previously considered by Kruskal’s algorithm, and $(u, v)$ is the first (hence the deepest) edge crossing $(A, B)$, $(u, v)$ justified by the Cut Property.

Implementing Kruskal’s Algorithm via Union-Find

Running time of straightforward implementation

$$ \mathcal{O}(m\log{n}) + \mathcal{O}(mn) = \mathcal{O}(mn) $$ Can we speed that up?

Plan: data structure for $\mathcal{O}(1)$ cycle check, which makes total $\mathcal{O}(m\log{n})$ time.

Union-Find data structure

Raison d’être of a union-find data structure: maintain partition of a set of objects.

Operations:

Why useful for Kruskal’s algorithm?

Union-Find basics

Idea number 1

Invariant: each vertex points to the leader vertex of its component (“name” of a component inherited from leader vertex).

Key point: given edge $(u, v)$, can check if $u$ and $v$ already in same component in $\mathcal{O}(1)$ time (if and only if leader pointers of $u$ and $v$ match i.e. $Find(u) = Find(v)$).

Hence, $\mathcal{O}(1)$ time cycle checks.

Maintaining the invariant

Note: when new edge $(u, v)$ added to $T$, connected components of $u$ and $v$ merge.

Question: how many leader pointer updates are needed to restore the invariant in the worst case? Answer: $\Theta(n)$ e.g. when merging two components with $\frac{n}{2}$ vertices each.

Idea number 2

When two components merge, have smaller one inherit the leader of the larger one (easy to maintain a size field in each component to facilitate this).

Question: how many leader pointer updates are now needed to restore the invariant in the worst case? Answer: still $\Theta(n)$, same reason with the previous [[#Maintaining the invariant]] quiz.

Updating leader pointers

But: how many times does a single vertex have its leader pointer update over the course of Kruskal’s algorithm? Answer: $\Theta(\log{n})$

Reason: every time $v$’s leader pointer gets updated, population of its component at least doubles. This can only happen $\leq \log_2{n}$ times.

Running time of fast implementation

Scorecard:

Therefore: $\mathcal{O}(m\log{n})$ total.

SOTA and Open Questions

Question

Biggest question: can we do better than $\mathcal{O}(m\log{n})$ (running time for Prim + Kruskal)?

Answer: yes.

Open questions

Weirdest of all (Pettie/Ramachandran JACM 2002): optimal deterministic MST algorithm, but precise asymptotic running time is unknown! (between $\Theta(m)$ and $\Theta(m\alpha(n))$, but don’t know where).

Yet, there are open questions:

Further reading: (Eisner 97)

MST application to Clustering

Clustering

Informal goal: given $n$ points, classify into “coherent groups”.

Assumptions:

  1. as input, given a (dis)similarity measure - a distance $d(p, q)$ between each point pair.
  2. symmetric (i.e. $d(p, q) = d(q, p)$)

Examples: Euclidean distance, genome similarity, etc.

Goal: same cluster = nearby.

Max-spacing k-clustering

Assume: we know $k$ is number of clusters desired (in practice, can experiment with a range of values).

Call points $p$ and $q$ separated if they’re assigned to different clusters.

Definition: the spacing of a $k$-clustering is

$$\min_{\text{separated } p, q} d(p, q)$$ (the bigger, the better)

Problem statement: given a distance measure $d$ and $k$, compute the $k$-clustering with maximum spacing.

A Greedy Algorithm

- initially, each point in a separate cluster
- repeat until only k clusters:
	- let p, q be the closest pair of separated points (determines the current spacing)
	- merge the clusters containing p and q into a single cluster

Note: just like Kruskal’s MST algorithm, but stopped early.

This called single-link clustering

Correctness

Claim

Theorem: single-link clustering finds the max-spacing $k$-clustering

Proof: let $C_1, C_2, .., C_k$ be the greedy clustering with spacing $S$. Let $\hat{C_1}, \hat{C_2}, .., \hat{C_k}$ be the arbitrary other clustering. We need to show the spacing of $\hat{C_1}, \hat{C_2}, .., \hat{C_k} \leq S$

[!tips]- Easy case: $p, q$ directly merged at some point if $p, q$ directly merged at some point, $S \geq d(p, q) \geq$ spacing of $\hat{C_1}, .. \hat{C_k}$

[!tips]- Tricky case: $p, q$ “indirectly merged” through multiple direct merges. Let $p, a_1, .., a_l, q$ be the path of direct greedy merges connecting $p$ and $q$. Key point: Since $p \in \hat{C_i}$ and $q \notin \hat{C_i}$, therefore exists consecutive pair $a_j, a_{j + 1}$ with $a_j \in \hat{C_i}, a_{j + 1}\notin \hat{C_i}$ such that $S \geq d(a_j, a_{j + 1}) \geq$ spacing of $\hat{C_1}, .., \hat{C_k}$

Advanced Union-Find

TODO

Quiz

Question 1

Suppose we are given a directed graph $G=(V,E)$ in which every edge has a distinct positive edge weight. A directed graph is acyclic if it has no directed cycle. Suppose that we want to compute the maximum-weight acyclic subgraph of $G$ (where the weight of a subgraph is the sum of its edges’ weights). Assume that $G$ is weakly connected, meaning that there is no cut with no edges crossing it in either direction.

Here is an analog of Prim’s algorithm for directed graphs. Start from an arbitrary vertex $s$, initialize $S={s}$ and $F=\emptyset$. While $S\neq V$, find the maximum-weight edge $(u,v)$ with one endpoint in $S$ and one endpoint in $V−S$. Add this edge to $F$, and add the appropriate endpoint to $S$.

Here is an analog of Kruskal’s algorithm. Sort the edges from highest to lowest weight. Initialize $F=\emptyset$. Scan through the edges; at each iteration, add the current edge $i$ to $F$ if and only if it does not create a directed cycle.

Which of the following is true?

Question 2

Consider a connected undirected graph $G$ with edge costs that are not necessarily distinct. Suppose we replace each edge cost $c_e$​ by $-c_e$​; call this new graph $G’$. Consider running either Kruskal’s or Prim’s minimum spanning tree algorithm on $G’$, with ties between edge costs broken arbitrarily, and possibly differently, in each algorithm. Which of the following is true?

Question 3

Consider the following algorithm that attempts to compute a minimum spanning tree of a connected undirected graph $G$ with distinct edge costs. First, sort the edges in decreasing cost order (i.e., the opposite of Kruskal’s algorithm). Initialize $T$ to be all edges of $G$. Scan through the edges (in the sorted order), and remove the current edge from $T$ if and only if it lies on a cycle of $T$.

Which of the following statements is true?

Question 4

Consider an undirected graph $G=(V,E)$ where edge $e\in E$ has cost $c_e$​. A minimum bottleneck spanning tree $T$ is a spanning tree that minimizes the maximum edge cost $\max_{e\in T}​c_e$​. Which of the following statements is true? Assume that the edge costs are distinct.

Question 5

You are given a connected undirected graph $G$ with distinct edge costs, in adjacency list representation. You are also given the edges of a minimum spanning tree $T$ of $G$. This question asks how quickly you can re-compute the MST if we change the cost of a single edge. Which of the following are true? (RECALL: It is not known how to deterministically compute an MST from scratch in $\mathcal{O}(m)$ time, where $m$ is the number of edges of $G$.)