Outline Data Structures and Algorithms. Data compression. Data compression. Lossy vs. Lossless. Data Compression

Similar documents
Lecture 6: Coding theory

18.06 Problem Set 4 Due Wednesday, Oct. 11, 2006 at 4:00 p.m. in 2-106

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

Lossless Compression Lossy Compression

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

Factorising FACTORISING.

Lecture 2: Cayley Graphs

CS 491G Combinatorial Optimization Lecture Notes

for all x in [a,b], then the area of the region bounded by the graphs of f and g and the vertical lines x = a and x = b is b [ ( ) ( )] A= f x g x dx

The DOACROSS statement

Finite State Automata and Determinisation

Mid-Term Examination - Spring 2014 Mathematical Programming with Applications to Economics Total Score: 45; Time: 3 hours

Now we must transform the original model so we can use the new parameters. = S max. Recruits

CS 2204 DIGITAL LOGIC & STATE MACHINE DESIGN SPRING 2014

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 4

Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

Chapter 4 State-Space Planning

CARLETON UNIVERSITY. 1.0 Problems and Most Solutions, Sect B, 2005

CS 573 Automata Theory and Formal Languages

2.4 Theoretical Foundations

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

6.5 Improper integrals

Parse trees, ambiguity, and Chomsky normal form

Nondeterministic Finite Automata

NON-DETERMINISTIC FSA

Discrete Structures, Test 2 Monday, March 28, 2016 SOLUTIONS, VERSION α

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Discrete Structures Lecture 11

Improper Integrals. The First Fundamental Theorem of Calculus, as we ve discussed in class, goes as follows:

Welcome. Balanced search trees. Balanced Search Trees. Inge Li Gørtz

Lecture 3. XML Into RDBMS. XML and Databases. Memory Representations. Memory Representations. Traversals and Pre/Post-Encoding. Memory Representations

Numbers and indices. 1.1 Fractions. GCSE C Example 1. Handy hint. Key point

(a) A partition P of [a, b] is a finite subset of [a, b] containing a and b. If Q is another partition and P Q, then Q is a refinement of P.

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

Lecture 11 Binary Decision Diagrams (BDDs)

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Algebra 2 Semester 1 Practice Final

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER MACHINES AND THEIR LANGUAGES ANSWERS

CS 360 Exam 2 Fall 2014 Name

Surds and Indices. Surds and Indices. Curriculum Ready ACMNA: 233,

Implication Graphs and Logic Testing

System Validation (IN4387) November 2, 2012, 14:00-17:00

Before we begin. HW4 is out! Announcements. Lossy Compression. Eliza. Eliza s impact

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Review of Gaussian Quadrature method

The Regulated and Riemann Integrals

6. Suppose lim = constant> 0. Which of the following does not hold?

n f(x i ) x. i=1 In section 4.2, we defined the definite integral of f from x = a to x = b as n f(x i ) x; f(x) dx = lim i=1

Lecture 8: Abstract Algebra

INTEGRATION. 1 Integrals of Complex Valued functions of a REAL variable

Section 6: Area, Volume, and Average Value

Surface maps into free groups

Using integration tables

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Interpreting Integrals and the Fundamental Theorem

Necessary and sucient conditions for some two. Abstract. Further we show that the necessary conditions for the existence of an OD(44 s 1 s 2 )

Equivalent fractions have the same value but they have different denominators. This means they have been divided into a different number of parts.

Tutorial Worksheet. 1. Find all solutions to the linear system by following the given steps. x + 2y + 3z = 2 2x + 3y + z = 4.

Lecture 1: Introduction to integration theory and bounded variation

CMSC 330: Organization of Programming Languages

Nondeterministic Automata vs Deterministic Automata

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Chapter 0. What is the Lebesgue integral about?

CIT 596 Theory of Computation 1. Graphs and Digraphs

where the box contains a finite number of gates from the given collection. Examples of gates that are commonly used are the following: a b

XML and Databases. Exam Preperation Discuss Answers to last year s exam. Sebastian Maneth NICTA and UNSW

Let s divide up the interval [ ab, ] into n subintervals with the same length, so we have

Minimal DFA. minimal DFA for L starting from any other

Laboratory for Foundations of Computer Science. An Unfolding Approach. University of Edinburgh. Model Checking. Javier Esparza

Lesson 2.1 Inductive Reasoning

Algorithm Design and Analysis

Read section 3.3, 3.4 Announcements:

GNFA GNFA GNFA GNFA GNFA

5.7 Improper Integrals

Situation Calculus. Situation Calculus Building Blocks. Sheila McIlraith, CSC384, University of Toronto, Winter Situations Fluents Actions

Something found at a salad bar

I 3 2 = I I 4 = 2A

Linear Systems with Constant Coefficients

4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve

Lecture 2e Orthogonal Complement (pages )

Math 32B Discussion Session Week 8 Notes February 28 and March 2, f(b) f(a) = f (t)dt (1)

Lecture 3: Equivalence Relations

Subsequence Automata with Default Transitions

CSC2542 State-Space Planning

GM1 Consolidation Worksheet

The Wave Equation I. MA 436 Kurt Bryan

CS261: A Second Course in Algorithms Lecture #5: Minimum-Cost Bipartite Matching

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

Arrow s Impossibility Theorem

Logic Synthesis and Verification

Physics 9 Fall 2011 Homework 2 - Solutions Friday September 2, 2011

Engr354: Digital Logic Circuits

Transcription:

5-2 Dt Strutures n Algorithms Dt Compression n Huffmn s Algorithm th Fe 2003 Rjshekr Rey Outline Dt ompression Lossy n lossless Exmples Forml view Coes Definition Fixe length vs. vrile length Huffmn s Algorithm The lgorithm Prtil onsiertions Dt ompression Dt Compression Is one of the funmentl tehnologies of the Internet. Is neessry for fster t trnsmission. Useful even lolly to keep smller files or kup t. Dt ompression Types of ompression Lossless enoes the originl informtion extly. Lossy pproximtes the originl informtion. Uses of ompression Imges over the we: JPEG Musi: MP3 Generl-purpose: ZIP, GZIP, JAR, Lossy vs. Lossless If lossless ompression represents extly the sme t, in ompresse form, why use lossy t ll? Mye you n get exellent ompression without too muh loss of informtion? Let s look t n exmple

Compre two imges So where is the ifferene? One imge is 400K the other is 00K. Whih is whih? Wht n we onlue? There is efinitely tre-off. Lossless my not perform so well, ut it retins 00% of the informtion. Lossy n perform extremely well, ut is the ompression worth the loss of informtion? So how o we eie whih one to use? Some Consiertions Wht types of files woul you use lossless lgorithm on? Wht types of files woul you use lossy lgorithm on? Some Consiertions Wht types of files woul you use lossless lgorithm on? 2 gres. Complete works of Shkespere. Wht types of files woul you use lossy lgorithm on? Imges, musi. Files where you n get wy with n pproximtion to the t. Another Exmple - SVD Hyri Algorithm. You ompress n imge to ertin rnk So epening on the rnk, you hve either lossy or lossless informtion. But mking this lgorithm lossless tully oules the size of the file! In wht kins of situtions might it e useful? 2

Another Exmple - SVD Suppose we sen root to explore the moon. It oesn t know wht informtion is useful to us. We n sk it to first sen us smll rnk n then, if we re intereste, we n sk for lrger rnks. Ultimtely we get the ll the informtion, ut only if we relly wnt/nee it. Another Exmple - SVD Oky, so the root on the moon seems it ontrive. But wht out surfing the we on your hnhel? There is so muh nonsense on the we, we lerly on t wnt to ownlo everything sine nwith is t premium. Another Exmple - SVD Question Is there lossless ompression lgorithm tht n ompress ny file? Rnk Rnk 8 Rnk 6 Originl Answer Asolutely not! Why not? How oes ompression work? Lossy lgorithms re generlly mthemtilly se. They work y pplying trnsforms. Eg. JPEG isrete osine trnsform By pplying trnsform, they ttempt to pproximte the originl t. Lossless lgorithms nnot o tht sine they nee to mintin the originl t. So wht n they o? 3

How oes ompression work? They nee to nlyze the file n tke vntge of ertin properties it might hve. Or its struture. For exmple, if you wnte to ompress the first 0000 igits of Pi, wht oul you o? In se the slippe your min, here they re Pi 0000 3459265358979323846264338327950288497693993750582097494459230786406286208998628034825342706798248086532823 06647093844609550582237253594082848745028402709385205559644622948954930389644288097566593344628475648233 7867836527209094564856692346034860454326648233936072602494273724587006606355887488520920962829254097536 43678925903600330530548820466523844695945609433057270365759599530928673893267930585480744623799627 49567358857527248922793883094929833673362440656643086023949463952247379070279860943702770539277629376752 384674884676694053200056827452635608277857734275778960973637787246844090224953430465495853705079227968925 8923542099562290296086403448598362977477309960587072349999998372978049950597373286096385950244594553 4690830264252230825334468503526938870000337838752886587533208384206777669473035982534904287554687359562 863882353787593759577885778053272268066300927876695909264209893809525720065485863278865936533882796823 0309520353085296899577362259943892497277528347935557485724245450695950829533686727855889075098387546374 64939392550604009277067390098488240285836603563707660047089429555969894676783744944825537977472684704047 534646208046684259069492933367702898952047526205696602405803850935253382430035587640247496473263949927260 4269922796782354786360093472642992458635030286829745557067498385054945885869269956909272079750930295532653 449872027559602364806654999888347977535663698074265425278625588475746728909777727938000864706006452499273 2724772350444973568548636573525523347574849468438523323907394433345477624686258983569485562099292228 427255025425688767790494606534668049886272327978608578438382796797668454009538837863609506800642252520573929 848960842848862694560424965285022206686306744278622039949450472373786960956364379728746776465757396243890 865832645995833904780275900994657640789526946839835259570982582262052248940772679478268482604769909026403639443 745530506820349625245749399654342980990659250937226964655709858387405978859597729754989306753928468382686 8386894277455998559252459539594304997252468084598727364469584865383673622262609924608052438843904524436549762 780797756943599770029660894469486855584840635342207222582848864858456028506068427394522674676788952523852254 9954666727823986456596635488623057745649803559363456874324255076069479450965960940252288797089345669368672 2874894056005033086792868092087476097824938589009749096759852636554978893297848268299894872265880485756404 270477555323796445523746234364542858444795265867820543547357395233427660235969536234429524849378704 576540359027993440374200730578539062983874478084784896833244573868759435064302845390484800537064680674992 789979399520649663428754440643745237892799983905995684675426923974894090786494239656794520809546 55022523603889304209376237855956638937787083039069792077346722825625996650425030680384477345492026054466592 52049744285073258666002324340889070486337346496545390579626856005508066587969986357473638405257459028970 64400972062804390397595567757700420337869936007230558763763594287325472053292898268625867325799844 84882964470609575270695722097567672290986909528073506727485832228783520935396572520835795369882094442 00675033467034267369908658563983509706556857437657683555650884909989859982387345528336355076479 8535893226854896323293308985706420467525907095484654985946637802709899430992448895757282890592323326097299 720844335732654893823993259746366730583604428388303203824903758985243744702932765680937734440307074692209 3020330380976200044929325608424448596376698389522868478323552658234495768572624334489303968642624340773 22697802807389544004468232527620052652272660396665573092547055785376346682065309896526986205647693257 058635662085580072936065987648679045334885034636576867532494466803962657978778556084552965426654085306434 443858676975456640680070023787765934407274947042056223053899456340727000407854733269939084546646458807972 70826683063432858785698305235808933065757406795457637752542024955765840025026228594302647550979259230990796 547376255765675357578296664547797450299648903046399473296207340437589573596458909389737904297828564 75032039869540287080859904800942472237947647772622442548545403325785306422883758504306332758297986622 377259607766925474873898665494945046540628433663937900397692656724638530673609657209807638327664627488880 07869256029022847204037286082049000422966796377923375754959505660496386294726547364252308770367559067 3502350728354056704038674353622224775895049530984448933309634087807693259939780549344473774484263298608099888 A Forml View of Compression We nee to know some istriution over hrter frequenies. We n then tke vntge of tht to enoe more frequently ourring hrters with fewer its. E.g. Huffmn s Algorithm. Alterntely, we n look hrter ptterns rther thn just hrters. We n then reple frequently ourring ptterns with smll oe. E.g. Lempel- Ziv-Welh or LZW (next leture). Interlue: Bit-level Representtion of Dt All t is store on omputer s sequene of 0 s n s. I.e. its. This is very nturl wy to represent t, for the following reson: A omputer nnot, in generl, infer 0 ifferent vlues from the intensity of signl. It n however infer 2 ifferent vlues very esily. I.e. whether the signl is high or low. The prolem: If we use sequenes of just 0 s n s inste of 0 9 to represent t, regrless of the onveniene, ren t we using lot more spe? To ress this issue, let s look t n exmple 4

Suppose you h text file (sy, the omplete works of Shkespere) n you know tht it hs 32 ifferent symols n totl of 00000 hrters. How muh spe woul e nee to represent this in se 0? How out se 2? Suppose you h text file (sy, the omplete works of Shkespere) n you know tht it hs 32 ifferent symols n totl of 00000 hrters. How muh spe woul e nee to represent this in se 0? 00000 * Log0 32 How out se 2? 00000 * Log2 32 But Log2 32 = Log2 0 * Log0 32 So we re only onstnt ftor off. Rell symptoti nlysis: Wht o we o with the onstnts? Rememer, we re more intereste in the sling ftor. Oky, so we ve estlishe tht s it s esiest to store t s sequene of 0 s n s, ut how oes tht help us? In prtiulr, how o I tke text file n store it on the omputer? To o this we nee to invent oe. Coes Coes We n think of t s lrge sequene of its whih n e prtitione into smller meningful sequenes. A oe then is simply mpping from sequenes of its to hrters (or something meningful) For exmple, the ASCII system is oe. It mps single ytes (8 its) to unique hrters. 5

Coes You n think of oe s funtion mpping hrters to it strings. We woul like it to e ijetion. Wht if it is mny-to-one? Wht if it is one-to-mny? In the one-to-mny se nothing spetulrly hppens, ut it is pin to use the oe. Coes A oewor is simply inry string n oe is olletion of oewors n their menings. Must eh oewor in oe neessrily hve the sme length? I.e. is every oe fixe length oe? If not, we n then onstrut oes. But if ll the oewors in oe re the sme length, then Huffmn s lgorithm wouln t ompress t t ll! Prefix Free Coes Enoing strings A prefix free oe is one where no oewor is prefix of nother oewor. Wht gret ie! We n now onstrut oes whose oewors re of vrying lengths. Known s vrile length oes. Let s see how they help us Symols Fixe-length oe Vrile-length oe 50 We n enoe e s: 00 000 0 00 000 00 Relly n 8 it string: 0000000000000 Totl 205 hrs 000 00 00 0 00 65 its 0 25 5 00 000 40 0 e 75 450 its Enoing strings Enoing strings Symols e Totl Symols e Totl Frequeny 50 25 5 40 75 205 hrs Frequeny 50 25 5 40 75 205 hrs Fixe-length oe 000 00 00 0 00 65 its Fixe-length oe 000 00 00 0 00 65 its Vrile-length oe 0 00 000 0 450 its Vrile-length oe (optiml) 0 00 000 0 450 its We n enoe e s: 00 0 0 000 0 Relly 4 it string: 00000000 Vrile-length oes Exploit sttistis of symols. More frequently ourring symols enoe using fewer its. Wht mkes goo vrile-length oe? It shoul e prefix free! 6

Tree representtion Why full inry tree? Represent prefix free oes s full inry trees Full: every noe Is lef, or Hs extly 2 hilren. The enoing is then (unique) pth from the root to lef. 0 =, =00, =000, =0 0 0 A noe with no siling n e move up level, improving the oe. An optiml oe for string n lwys e represente y full inry tree. 0 Enoing ost Huffmn s Algorithm Alphet: C Symol: Symol Frequeny: f() Depth in tree T: () (() is lso numer of its to enoe ) Enoing ost: K = C ( ) f ( ) Q: How to onstrut full inry tree tht minimizes K? Huffmn s Algorithm Huffmn s lgorithms will give you n optiml prefix free oe y onstruting n pproprite tree. Dt struture use: A Priority Queue. insert(element, priority) inserts n element with given priority into the queue. eletemin() returns the element with lest priority. Huffmn s Algorithm. Compute f() for every symol C 2. insert(, f()) into priority queue Q 3. for i = to C - (while Q is not empty) 4. z = new TreeNoe() 5. x = z.left = Q.eleteMin() 6. y = z.right = Q.eleteMin() 7. f(z) = f(x) + f(y) 8. Q.insert(z, f(z)) 9. return Q.eleteMin() 7

Exmple Exmple Exmple Exmple Exmple Exmple 8

Huffmn s Algorithm Is greey lgorithm tht onstruts n optiml prefix free oe for given piee of t Does it relly generte n optiml prefix free oe? Yes, ut the proof is eyon the sope of this ourse! Greey Algorithms At every step greey lgorithm mkes lolly optiml eision hoping tht it will up to glol optimum. This strtegy works surprisingly well for lot of lgorithms. Some exmples: Huffmn s for t ompression. Kruskl s for lulting minimum spnning tree s in grphs. Hill Climing Suppose you wnte to reh the summit of mountin ut oul only see 0 metres in ny iretion (ue to fog). Whih wy woul you go? Hill Climing Mking the lolly-est guess is effiient n esy, ut oesn t lwys work. Huffmn s Algorithm Why is it greey? Beuse t eh itertion in the loop, it pike the two optiml trees in the priority queue with whih to rete new noe without onsiering their implitions from glol stnpoint. Hw4 Hw4 is the t ompression l Prts n 2 re lossless ompression using Huffmn n LZW Prt 3 is tritionlly some lossy lgorithm, ut this semester it will e ompetition. Coneptully, ll you nee to know is in the letures. But this l n get very triky sine you will e eling with its n ytes n some low level stuff. 9

Notie tht Huffmn s lgorithm, in the setting we stuie it, n only ompress files of hrters sine it nees to know wht the lphet is in orer to ount the frequenies. Do we nee to moify the lgorithm in orer to ompress ritrry files? Tke minute to think out this. No, we on t! Suppose we hve file F to ompress. We n tret F s strem of its. So we re the first yte n onsier it in the ontext of our preefine lphet. ASCII in this se. Impliitly, we then en up treting every file s text file. Is tht goo ie? Wht out imges? It oesn t mtter! So long s we reproue the originl it sequene fter eompression. We n tret the file s ontining just the hrters {,,,} if we wnt, it won t ffet the orretness of our lgorithm. It will, however, ffet the performne. Why? 0