Algorithm Design and Analysis LECTURE 8 Greedy Algorithms V Huffman Codes Adam Smith
Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: Let e be the cheapest edge in G. Some MST of G contains e? Let e be the most expensive edge in G. No MST of G contains e? 9/15/2008
Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: Let e be the cheapest edge in G. Some MST of G contains e? (Answer:Yes, by the Cut Property) Let e be the most expensive edge in G. No MST of G contains e? (Answer: False. Counterexample: if G is a tree, all its edges are in the MST) 9/15/2008
Review Exercise: given an undirected graph G, consider spanning trees produced by four algorithms BFS tree DFS tree shortest paths tree (Dijsktra) MST Find a graph where all four trees are the same all four trees must be different (note: DFS/BFS may depend on exploration order)
Non-distinct edges? Read in text 9/15/2008
Implementing MST algorithms Prim: similar to Dijkstra Kruskal: Requires efficient data structure to keep track of islands : Union-Find data structure We may revisit this later in the course You should know how to implement Prim in O(m log m/n n) time
Implementation of Prim(G,w) 9/15/2008 IDEA: Maintain V S as a priority queue Q (as in Dijkstra). Key each vertex in Q with the weight of the leastweight edge connecting it to a vertex in S. Q V key[v] for all v V key[s] 0 for some arbitrary s V while Q do u EXTRACT-MIN(Q) for each v Adjacency-list[u] do if v Q and w(u, v) < key[v] then key[v] w(u, v) DECREASE-KEY [v] u At the end, {(v, [v])} forms the MST.
Analysis of Prim n times Handshaking Lemma (m) implicit DECREASE-KEY s. Time: as in Dijkstra 9/15/2008 (n) total degree(u) times Q V key[v] for all v V key[s] 0 for some arbitrary s V while Q do u EXTRACT-MIN(Q) for each v Adj[u] do if v Q and w(u, v) < key[v] then key[v] w(u, v) [v] u
Analysis of Prim n times degree(u) times while Q do u EXTRACT-MIN(Q) for each v Adj[u] do if v Q and w(u, v) < key[v] then key[v] w(u, v) [v] u Handshaking Lemma (m) implicit DECREASE-KEY s. PQ Operation ExtractMin DecreaseKey Prim n m Array n 1 Binary heap log n log n d-way Heap HW3 HW3 Fib heap log n 1 Total n 2 m log n m log m/n n m + n log n Individual ops are amortized bounds 9/15/2008
Greedy Algorithms for MST Kruskal's: Start with T =. Consider edges in ascending order of weights. Insert edge e in T unless doing so would create a cycle. Reverse-Delete: Start with T = E. Consider edges in descending order of weights. Delete edge e from T unless doing so would disconnect T. Prim's: Start with some root node s. Grow a tree T from s outward. At each step, add to T the cheapest edge e with exactly one endpoint in T. 9/15/2008
Union-Find Data Structures Operation\ Implementation Array + linked-lists and Balanced Trees sizes Find (worst-case) (1) (log n) Union of sets A,B (worst-case) Amortized analysis: k unions and k finds, starting from singletons (min( A, B ) (could be as large as (n) (log n) (k log k) (k log k) With modifications, amortized time for tree structure is O(n Ack(n)), where Ack(n), the Ackerman function grows much more slowly than log n. See KT Chapter 4.6 9/15/2008
Huffman codes
Prefix-free codes Binary code maps characters in an alphabet (say {A,,Z}) to binary strings Prefix-free code: no codeword is a prefix of any other ASCII: prefix-free (all symbols have same length) Not prefix-free: a 0 b 1 c 00 d 01 Why is prefix-free good?
A prefix-free code for a few letters e.g. e 00, p 10011 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, Source: K. Wayne WIkipedia
A prefix-free code 170 59 111 S 27 32 60 51 16 N 16 39 21 25 E 26 H 8 W 8 T 17 10 11 22 12 I 13 8 O 9 F 5 V 5 Y 5 R 6 6 6 X 4 4 A 3 C 3 G 3 3 L 2 U 2 e.g. T 1001, U 1000011 D 2 Z 1 Source: Jeff Erickson notes.
How good is a prefix-free code? Given a text, let f[i] = # occurrences of letter i Total number of symbols needed f[i] depth(i) i How do we pick the best prefix-free code?
Huffman s Algorithm (1952) Given individual letter frequencies f[1,.., n]: Find the two least frequent letters i,j Merge them into symbol with frequency f[i]+f[j] Repeat e.g. a: 6 b: 6 c: 4 d: 3 e: 2 Theorem: Huffman algorithm finds an optimal prefix-free code
Warming up Lemma 0: Every optimal prefix-free code corresponds to a full binary tree. (Full = every node has 0 or 2 children) Lemma 1: Let x and y be two least frequent characters. There is an optimal code in which x and y are siblings.
Huffman codes are optimal Proof by induction! Base case: two symbols; only one full tree. Induction step: Suppose f[1], f[2] are smallest in f[1,,n] T is an optimal code for {1,,n} Lemma 1 ==> can choose T where 1,2 are siblings. New symbol numbered n+1, with f[n+1] = f[1]+f[2] T = code obtained by merging 1,2 into n+1
Cost of T in terms of T : n cost(t)= f [i] depth(i) i=1 n+1 = f [i] depth(i) + f [1] depth(1) + f [2] depth(2) f [n + 1] depth(n + 1) i=3 = cost(t )+ f [1] depth(1)+ f [2] depth(2) f [n + 1] depth(n + 1) = cost(t )+(f [1]+ f [2]) depth(t) f [n + 1] (depth(t) 1) = cost(t )+ f [1]+ f [2] Let H be Huffman code for {1,,n} Let H be Huffman code for {3,,n+1} Induction hypothesis cost(h ) cost(t ) cost(h) = cost(h )+f[1]+f[2] cost(t). QED
Notes See Jeff Erickson s lecture notes on greedy algorithms: http://theory.cs.uiuc.edu/~jeffe/teaching/algorithms/
Data Compression for real? Generally, we don t use letter-by-letter encoding Instead, find frequently repeated substrings Lempel-Ziv algorithm extremely common also has deep connections to entropy If we have time for string algorithms, we ll cover this
Huffman codes and entropy Given a set of frequencies, consider probabilities p[i] = f[i] / (f[1] + + f[n]) Entropy(p) = i p[i] log(1/p[i]) Huffman code has expected depth Entropy(p) i p[i] depth(i) Entropy(p) +1 To prove the upper bound, find some prefix free code where depth(i) log(1/p[i]) +1 for every symbol i Exercise! The bound applies to Huffman too, by optimality
Prefix-free encoding of arbitrary length strings What if I want to send you a text But you don t know ahead of time how long it is? 1: put length at the beginning: n+log(n) bits requires me to know the length 2: every B bits, put a special bit indicating whether or not we re done: n(1+1/b) +B-1 bits Can we do better?