COMP11212 Fundamentals of Computation Part 1: Formal Languages. Andrea Schalk

COMP11212 Fundmentls of Computtion Prt 1: Forml Lnguges Andre Schlk A.Schlk@mnchester.c.uk Jnury 22, 2014

Orgniztionl issues When, wht, where This course will e tught s follows. Lectures: Will tke plce Wednesdys t 9.00 nd Thursdys t 10.00 in Kilurn 1.1. Exmples clsses: Will tke plce from Week 2 s follows. Groups Time Loction M+W Mondys 1.00 IT407 B+X Mondys 2.00 IT407 Y Mondys 3.00 IT407 Z Mondys 4.00 IT407 Two prts This course consists of two distinct prts, tught y two different memers of stff, Andre Schlk nd Dvid Lester. Lectures From To Content Tught y 1 29/01 29/01 Intro + strt Prt 1 Dvid & Andre 2 10 30/01 27/02 Prt 1 Andre 11 20 05/03 03/04 Prt 2 Dvid 21 22 30/09 01/05 Revision Andre & Dvid Assessment Exmples clsses in Weeks elongs to 2 6 Prt 1 7 11 Prt 2 There is ssessed coursework in the form of exercises you re expected to prepre efore ech exmples clss, see elow for more detil. The coursework is worth 25% of the overll mrk for the course unit. The exm consists of four question, with two questions from ech prt of the course. Ech question in the exm is worth twenty mrks. Students hve to nswer three out of the four questions. This mens your exm mrk will e out of sixty. It will e worth 75% of your mrk for this course unit. 2

Coursework You re expected to prepre exercises efore ech exmples clss, see the exercise sheets t the end of the notes. During the exmples clss these will e mrked, you will get feedck on wht you did, nd you will lso get help with the questions you could not do. Ech exercise is mrked out of two, where the mrks men the following. 0: You did not mke serious ttempt to solve the question. 1: You mde serious ttempt ut did not get close to finishing the question. 2: You did work your wy through the exercise nd mde no serious mistkes. There re four or five mrked exercises for ech exmples clss. Note tht not ll exercise re eqully difficult, or require n equl mount of work. If you re seriously struggling with one exercise, move on to the next one (they don t usully uild on ech other). If you re struggling for time, try to do s mny s you cn, leving out the more lour-intensive ones, ut try to do something. The rules for mrking re s follows: To get mrk for ny exercise you must produce ll the work you hve done for it efore the exmples clss when mrker sks you to show it to them. The mrker will sk to see ll your work for n exercise, so plese ring with you ny rough work. It is possile to get one mrk for n exercise where you hven t mde much, or ny, progress, provided your rough works give evidence tht you hve seriously tried it. I d expect there to e t lest pge or more of your trying things out to find n nswer. If you ring just finl nswer for n exercise tht requires steps in etween you will not get the mrks for it. If you tried n exercise nd got one mrk for it, nd you cn sustntilly improve wht you hve done fter demonstrtor hs explined it to you, you my sk the demonstrtor who mrked you originlly to remrk tht exercise, nd improve your mrk from 1 to 2. This is suject to the demonstrtor hving time ville in the exmples clss. You cnnot improve your mrk from 0 to 2. Hence ny exercise where you cn only show work you hve done during the exmples clss will not count. You my hve t most one remrking session per exmples clss (ut this my include more thn one exercise). Note tht the mrker my sk you how you solved the vrious exercises efore ssigning you mrk for ech of them. We lso expect you to hve ll your rough work ville if you hve only the solution where there is sustntil work required to get there the demonstrtor hs een told to give you 0. 3

If you cnnot descrie how you did the work then we hve to ssume tht you plgirized the solution nd you will get mrk of 0 for ll your work tht week. If this hppens twice we will put note in your file tht you hve een cught plgirizing work. Weeks in which you hve done this will not count towrds the seven exmples clss threshold (see elow). Note tht your coursework mrks only count if you hve something recorded for t lest seven out of the ten exmples clsses, or if there re mitigting circumstnces. Chnges Note tht the course unit ws tught very differently until (nd including) 2011/2012. Here re the most importnt chnges. There used to e three prts. The old second prt (Prt B in old exm ppers) is no longer tught. Prt 1 hs een very slightly, nd Prt 2 (the old Prt 3) sustntilly, expnded since then. In the old exms questions used to e worth ten mrks ech, with the exm eing mrked out of fifty. There used to e coursework on top of the exercises prepred for the exmples clsses. This no longer exists, ut now the exmples clss prep is eing mrked. 4

Aout Prt 1 of the course As detiled on the previous pge, Prt 1 of this course unit is concerned with forml lnguges nd is tught y Andre Schlk. The following concerns orgniztionl issues just for this first prt. Lerning the mteril The most importnt ims of this prt of the course re to tech you few concepts (such s forml lnguge, regulr expression, finite stte utomton, context-free grmmr) nd how to solve numer of typicl prolems, with n emphsis on the ltter. Exercises. The est wy to lern how to solve prolems is to do it, which is why the notes contin gret numer of exercises. Those not covered in the exercise sheets provide further opportunities to prctise, for exmple when revising for the exms. The exercises re prt of the exminle mteril, lthough I won t sk you to recll ny prticulr exercise in the exm. Some exercises hve *, which mens tht they re not prt of the exminle mteril. Note tht they re not necessrily difficult, lthough some of them re. Lectures. I use the lectures to introduce the min ides, nd to explin the importnt lgorithms, s well s suggesting some wys for solving typicl prolems. Self-study. The course hs only three contct hours week, nd you re expected to spend sustntil mount of time ech week working on the mteril y studying the notes yourself, connecting the written mteril with wht ws introduced in the lectures, nd solving exercises. Notes. The notes re firly sustntil ecuse they im to cover the mteril in detil. Typiclly, for ech new ide or lgorithm, there is forml description s well s one or two exmples. For most people it is worthwhile trying to understnd the exmple efore tckling the generl cse, lthough in the notes the generl cse is sometimes descried efore the exmple (ecuse it cn e difficult to explin the ide using just specil cse). Note tht there is glossry tht you cn use to quickly find the mening of the vrious technicl terms used. Mthemticl nottion. The lnguge used to descrie much of theoreticl computer science is tht of mthemtics, ecuse tht is the only wy of descriing with precision wht we re tlking out. In the min prt of the notes I hve tried to remin s informl s possile in my descriptions, lthough ll the definitions given re forml. You will not e required to repet these definitions in the exm, you only need to understnd informlly wht they men. There is n ppendix which spells out in more detil how the lnguge of mthemtics is used to mke rigorous the 5

concepts nd opertions we use. If you only understnd the informl side of the course you cn pss the exm (nd even get very good mrk) ut to mster the mteril (nd get n excellent result in the exm) you will hve to get to grips with the mthemticl nottion s well. A mximum of 10% of the finl mrk depends on the mteril in the Appendix. Exmples clsses. The exmples clsses give you the opportunity to get help with ny prolems you hve, either with solving exercises or with understnding the notes. There re four exmples clsses ssocited with this prt of the course, in Weeks 2 5. For ech of these you re expected to prepre y solving the key exercises on the pproprite sheet. It lso suggests exercises to do if you re working through the Appendix, nd there re suggestions for dditionl exercises you might wnt to do if you find the set work esy. The sheets cn e found from pges 88 of the notes. We will check in ech exmples clsses whether you hve done the preprtion, nd the dt will e entered into Arcde. Solutions for ech exercise sheet will e mde ville on the wepge for this prt of the course fter the lst ssocited exmples clss hs tken plce. Mrking criteri. The mrking criteri for the ssessed coursework re stricter thn those for the exm. In prticulr, I sk you to follow vrious lgorithms s descried in the notes, wheres in the exm I m hppy if you cn solve the prolem t hnd, nd I don t mind how exctly you do tht. In ll cses it is importnt to show your work if you only give n nswer you my lose sustntil numer of mrks. Revision. For revision purposes I suggest going over ll the exercises gin. Some of the exercises will proly e new to you since they re not prt of the set preprtion work for the exmples clsses. Also, you cn turn ll the NFAs you encounter long the wy into DFAs (removing ǫ-trnsitions s required). Lstly, there re exms from previous yers to prctice on. If you cn do the exercises you will do well in the exm. You will not e sked to repet definitions, ut knowing out properties of regulr nd contextfree lnguges my e dvntgeous. You should lso e wre of the few theorems, lthough you will not e sked to recite them. Wepge for the course. http://studentnet.cs.mnchester.c. uk/ugt/comp11212/ Reding For this prt of the course these notes cover ll the exminle mteril. However, sometimes it cn e useful to hve n dditionl source to see the sme mteril introduced nd explined in slightly different wy. Also, there re new exmples nd exercises to e looked t. For this purpose you my find one of the following useful. M. Sipser. Introduction to the Theory of Computtion. PWS Pulishing Compny, 1997. ISBN 0-534-94728-X. This is firly mthemticl ook tht nonetheless tries to keep mthemticl nottion t minimum. It contins mny illustrted exmples nd ims to explin ides rther thn going through proofs mechniclly. A 2005 edition is lso ville. Around 50 this ook is very well thought of nd even used copies re quite expensive. Relevnt to this course: Chpters 0 (wht hsn t yet een covered y COMP10020), 1 nd 2.1 6

J.E. Hopcroft, R. Motwni, nd J.D. Ullmn. Introduction to Automt Theory, Lnguges, nd Computtion. Addison Wesley, second edition, 2001. ISBN 0-201-44124-1. This ccount is derived from the clssicl textook on the suject. It is more verose thn Sipser, nd it ims to include up-to-dte exmples nd pplictions, nd it develops the mteril in much detil. A numer of illustrted exmples re lso included. I hve herd tht this ook is very populr mong students. 2006/7 editions re lso ville. Aout 50. Relevnt for this course re Chpters 1 to 5. Get in touch Feedck. I m lwys keen to get feedck regrding my teching. Although the course hs een tught few times the notes proly still contin some errors. I d like to her out ny errors tht you my find so tht I cn correct them. I will mke those ville t http://studentnet. cs.mnchester.c.uk/ugt/comp11212/ nd fix them for future yers. You cn tlk to me fter lectures, or send me emil (use the ddress on the title pge). Rewrd! If you cn find sustntil mistke in the lecture notes (tht is, more thn just typo, or minor lnguge mistke) you get chocolte r. Wnted. I m lso on the lookout for good exmples to use, either in the notes or in the lectures. These should e interesting, touch importnt pplictions of the mteril, or e fun. For relly good resonly sustntil such exmple the rewrd is chocolte r. Acknowledgements. I would like to thnk Howrd Brringer, Pete Jinks, Djihed Afifi, Frncisco Loo, Andy Ellyrd, In Worthington, Peter Sutton, Jmes Bedford, Mtt Kelly, Cong Jing, Mohmmed Sr, Jons Lorenz, Toms Mrkevicius nd Joe Rzvi for tking the time to help me improve these notes. 7

Contents Orgniztion 2 1 Wht it is ll out 10 2 Descriing lnguges to computer 12 2.1 Terminology............................ 12 2.2 Defining new lnguges from old ones............. 13 2.3 Descriing lnguges through ptterns............. 14 2.4 Regulr expressions........................ 16 2.5 Mtching regulr expression.................. 17 2.6 The lnguge descried y regulr expression........ 18 2.7 Regulr lnguges........................ 20 2.8 Summry............................. 20 3 How do we come up with ptterns? 21 3.1 Using pictures........................... 21 3.2 Following word......................... 23 3.3 Finite stte utomt...................... 24 3.4 Non-deterministic utomt................... 26 3.5 Deterministic versus non-deterministic............. 29 3.6 From utomt to ptterns................... 35 3.7 From ptterns to utomt................... 44 3.8 Properties of regulr lnguges................. 52 3.9 Equivlence of Automt.................... 55 3.10 Limittions of regulr lnguges................ 60 3.11 Summry............................. 62 4 Descriing more complicted lnguges 63 4.1 Generting words......................... 63 4.2 Context-free grmmrs...................... 64 4.3 Regulr lnguges re context-free............... 68 4.4 Prsing nd miguity...................... 69 4.5 A progrmming lnguge.................... 72 4.6 The Bckus-Nur form...................... 73 4.7 Properties of context-free lnguges.............. 74 4.8 Limittions of context-free lnguges.............. 75 4.9 Summry............................. 75 Glossry 77 8

A In the lnguge of mthemtics 80 A.1 The sic concepts........................ 80 A.2 Regulr expressions........................ 83 A.3 Finite stte utomt...................... 84 A.4 Grmmrs............................. 87 Exercise Sheet 1 88 Exercise Sheet 2 89 Exercise Sheet 3 90 Exercise Sheet 4 91 Exercise Sheet 5 92 9

Chpter 1 Wht it is ll out Computers need to e le to interpret input from keyord. They hve to find commnds nd then crry out pproprite ctions. With commnd line interfce this my pper firly simple prolem, ut, of course, the keyord is not the only wy of feeding computer with instructions. Computers lso hve to e le to red nd interpret files, for exmple when it comes to compiling progrm from lnguge such s Jv, C or ML nd they then hve to e le to run the resulting code. In order to do this computers hve to e le to split the input into strings, prse these strings, nd turn those into instructions (such s Jv ytecode) tht the mchine cn crry out. How does computer orgnize potentilly quite lrge input? How does it decide wht vriles to crete, nd wht vlues to give these? In this prt of Fundmentls of Computtion we look t some techniques used for this purpose. In order to orgnize (or prse) the given text-like input computer hs to e le to recognize specific strings, for exmple simple commnds (such s if or else). The ility to find certin strings is useful in other contexts s well. When serch engine such s Google looks for string or phrse entered y user it certinly hs to e cple of telling when it hs found the string in question. However, most serch engines do their jo in much more clever wy thn tht: They will lso recognize plurls of strings entered, nd if such string should e ver of the English lnguge it will typiclly recognize different forms, such s type, typing nd typed. How cn we come up with clever wys of serching for something more thn just given fixed string? Agin this is n issue tht this course covers. How does the we server for the Intrnet in the School of Computer Science check whether you re ccessing it from within the University? How do online shopping ssistnts work? They find given product t vrious shops nd tell you wht it costs. How do they extrct the informtion, in prticulr since some of the pges descriing the product re creted on demnd only? Imgine you hd to write piece of code tht ccepts string s n input nd checks whether this string could e vlid emil ddress. Imgine you hd to write progrm tht goes through numer of emils in folder nd returns ll the suject lines for those (every emil client hs wy of doing this, of course). How would you do tht? Imgine you hd to write progrm tht goes through unch of documenttion files nd checks whether they contin ny doule typed words (such s the the ). This is very common mistke, nd professionl or- 10

gniztion would wnt to remove these efore delivering the softwre to customer. Wht if Your progrm hd to work cross lines (nd n ritrry numer of spces)? You hd to tke into ccount tht the first douled word might e cpitlized? The two douled words might e seprted y html tgs? Imgine you hd to write code for vending mchine tht tkes coins nd in return delivers vriously priced goods. How does the mchine keep trck of how much money the user hs put in so fr, nd how to produce correct chnge from the coins it holds? It s your jo to keep trck of ll the cses tht my rise. Imgine you hd to write piece of code tht tkes s input file contining Jv progrm nd checks whether the curly rckets { } re ll properly lnced. This course is going to help you with understnding how such tsks cn e crried out. However, we won t spend much time writing code. The tools we descrie here could e used when using the Unixgrep commnd, or when progrmming in lnguges such s Perl, Python, Tcl, or even when using the GNU Emcs editor. It is more importnt to me tht you understnd the ides, nd how to express them in resonly forml wy (so tht progrmming them then ecomes esy). The mteril cn e split into three prts, deling with the following issues. How do we descrie to computer wht strings to look for? How cn we come up with the right description to solve our prolem? How cn we write code tht checks whether some other piece of code is correctly formed, whether the opening nd closing rckets mtch, or whether wepge is written in vlid html? 11

Chpter 2 Descriing lnguges to computer In order to solve the kinds of prolems tht re mentioned in Chpter 1 we need to e le to descrie to computer wht it is looking for. Only in very simple cses will this consist of just one or two strings in generl, we wnt the computer to look for much lrger collection of words. This could e the set of ll possile IP ddresses within the University, or it could e the collection of ll strings of the form Suject:... (to pick out emils on prticulr topic), or it could e the set of ll strings of the form s s (in the simplest cse of finding ll douled words this one won t tke cre of douled words spred over two lines, nor of the first word eing cpitlized). 2.1 Terminology In order to tlk out how we do this in prctice we hve to introduce few terms so tht we cn tlk out the vrious concepts involved. You cn find forml definitions of these Appendix A. A symol is uilding lock for string. Symols cnnot e sudivided, they re the toms of everything we uild. In the theory of forml lnguges they re usully clled letters. Exmples:, A, 0, 1, %, @. We use letters from the end of the Romn lphet, such s x, y, z to refer to n ritrry symol. An lphet is collection of letters. We think of it s set. At ny given time we re only interested in those words we cn uild from previously specified lphet. Exmples: {0, 1}, {0, 1,...,9}, {,,...,z}, {,,...,z, A, B,...,Z},. We use cpitl Greek letters, typiclly Σ, to refer to n ritrry lphet. A string is something we uild from 0 or more symols. In the theory of forml lnguges we usully spek of words. Exmples:, 0, 1001. Note tht every letter cn e viewed s one-letter word. We use letters s, t, u to refer to n ritrry word. The empty word (which consists of 0 letters) is it difficult to denote. We use ǫ (the Greek letter ǫ ) for this purpose. Conctention is the opertion we perform on words (or letters) to otin longer words. When conctenting with we get, nd we 12

cn conctente tht with the word to otin. If we conctente 0 letters we get the word ǫ. When we conctente ny word with the word ǫ we otin the sme word. We use the nottion of powers for conctention s follows: If s is word then (s) n is the word we get y conctenting n copies of s. For exmple, (010) 3 = 010010010, 1 2 = 11, 0 = ǫ nd c 1 = c. A lnguge is collection of words, which we think of s set. Exmples re {ǫ},, {, c, } nd { n n N}. We use letters such s L, L 1 nd L to refer to n ritrry lnguge. In these notes we re interested in how we cn descrie lnguges in vrious wys. If lnguge is finite then we cn descrie it quite esily: We just hve to list ll its elements. However, this method fils when the lnguge is infinite, nd even when the lnguge in question is very lrge. If we wnt to communicte to computer tht it is to find ll words of such lnguge we hve to find concise description. 2.2 Defining new lnguges from old ones One wy of descriing lnguges is using set-theoretic nottion (see Appendix A for more detil). In the min development here we try to void eing overly mthemticl, ut there re some opertions on lnguges we need to consider. There re seven of them: Union. Since lnguges re just sets we cn form their unions. Intersection. Since lnguges re merely sets we cn form their intersections. Set difference. If L 1 nd L 2 re lnguges we cn form L 1 \ L 2 = {s L 1 s / L 2 }. Complement. If L is the lnguge of ll words over some lphet Σ, nd L is suset of L then the complement of L in L is L \ L, the set of ll words over Σ which re not contined in L. Conctention. Becuse we cn conctente words we cn use the conctention opertion to define new lnguges from existing ones y extending this opertion to pply to lnguges s follows. Let L 1 nd L 2 e lnguges over some lphet Σ. Then L 1 L 2 = {s t s L 1 nd t L 2 }. n-ry Conctention. If we pply conctention to the sme lnguge y forming L L there is no reson to stop fter just one conctention. For n ritrry lnguge L we define L n = {s 1 s 2 s n s i L for ll 1 i n}. We look t the specil cse L 0 = {s 1 s n s i L for ll 1 i n} = {ǫ}, since ǫ is wht we get when conctenting 0 times. 13

Kleene str. Sometimes we wnt to llow ny (finite) numer of conctentions of words from some lnguge. The opertion tht does this is known s the Kleene str nd written s ( ). The definition is: L = {s 1 s 2 s n n N, s 1, s 2,...,s n L} = n N L n. Note tht = n N n = {ǫ} = L 0 for ll L, which my e it unexpected. Exercise 1. To gin etter understnding of the opertion of conctention on lnguges, crry out the following tsks. () Clculte {,, } {ǫ,, }. () Clculte {ǫ,, } {, }. (c) Clculte {, 3, 6 } { 0, 2, 3 }. (d) Descrie the lnguge {0 m 1 n m, n N} s the conctention of two lnguges. (e) Clculte {0, 01, 001} 2. Exercise 2. To gin etter understnding of the Kleene str opertion crry out the following tsks. (You my use ny wy you like to descrie the infinite lnguges involved: Just using plin English is the esiest option. I would prefer not to see... in the description.) () Clculte {, }, {} {}, {} {}, {} \ {} nd the complement of {} in the lnguge of ll words over the lphet {, }. () Clculte {0 2n n N}. (c) Descrie the set of ll words uilt from given lphet Σ using the Kleene str opertion. 2.3 Descriing lnguges through ptterns By definition lnguges is set. While sets re something mthemticins cn del with very well, nd something tht computer scientists hve to e le to cope with, computers ren t t ll well suited to mke sense of expressions such s {(01) n n N}. There re etter wys of showing computer wht ll the words re tht we men in prticulr cse. We hve lredy seen tht we cn express the lnguge L in different wy, nmely s {01}. If there is only one word in lnguge, here {01}, we might s well leve out the curly rckets nd write 1 (01). Tht is ctully something 1 Cn you see why we need rckets here? 14

mchine cn cope with, lthough we would hve to use slightly different lphet, sy (01)^*. All computer now hs to do is to compre: Does the first chrcter 2 of my string equl 0? Does the next one 3 equl 1? And so on. Wht we hve done here is to crete pttern. It consists of vrious chrcters of the lphet, conctented. We re llowed to pply the Kleene str to ny prt of the string so creted, nd we use rckets () to indicte which prt should e ffected y the str. A computer cn then mtch this pttern. Are these ll the ptterns we need? Not quite. How, for exmple, would we descrie the lnguge {0 n n N} {1 n n N} = {x n x = 0 or x = 1}? We cn t use either of 0 1 or (rguly worse) (01) ecuse oth of these include words tht contin 0 s well s 1, wheres ny word in our trget lnguge consists entirely of 0s or entirely of 1s. We need to hve wy of sying tht either of two possiilities might hold. For this, we use the symol. Then we cn use 0 1 to descrie the lnguge ove. Exercise 3. Which of the following words mtch the given ptterns? Pttern ǫ () ( ) ( ) Exercise 4. Descrie ll the words mtching the following ptterns. For finite lnguges just list the elements, for infinite ones, try to descrie the words in question using English. If you wnt to prctise using set-theoretic nottion, dd description in tht formt. () (0 1 2) () (0 1)(0 2)2 (c) (01 10) (d) 0 1, (e) (01) 1, (f) 0 1, (g) (010), (h) (01) 0, (i) (01) (01). (j) (0 1) (k) 0 1 2 (l) 0 1 2 2 Wht should the computer do if the string is empty? 3 Wht if there isn t next symol? 15

2.4 Regulr expressions So fr we hve een using the ide of pttern intuitively we hve not sid how exctly we cn form ptterns, nor hve we properly defined when word mtches pttern. It is time to ecome rigorous out these issues. For reson of completeness we will need two ptterns which seem it weird, nmely ǫ (the pttern which is mtched precisely y the empty word ǫ) nd (the pttern which is mtched y no word t ll). Definition 1. Let Σ e n lphet. A pttern or regulr expression over Σ is ny word over Σ pt = Σ {, ǫ,,, (, )} generted y the following inductive definition. Empty pttern The chrcter is pttern; Empty word the chrcter ǫ is pttern; Letters every letter from Σ is pttern; Conctention if p 1 nd p 2 re ptterns then so is (p 1 p 2 ); Alterntive if p 1 nd p 2 re ptterns then so is (p 1 p 2 ); Kleene str if p is pttern then so is (p ). In other words we hve defined lnguge 4, nmely the lnguge of ll regulr expressions, or ptterns. Note tht while we re interested in words over the lphet Σ we need dditionl symols to crete our ptterns. Tht is why we hve to extend the lphet to Σ pt. In prctice we will often leve out some of the rckets tht pper in the forml definition ut only those rckets tht cn e uniquely reconstructed. Otherwise we would hve to write ((0 1) 0) insted of the simpler (0 1) 0. In order to e le to do tht we hve to define how to put the rckets ck into such n expression. We first put rckets round ny occurrence of with the su-pttern immeditely to its left, then round ny occurrence of conctention, nd lstly round the lterntive opertor. Note tht every regulr expression with ll its rckets hs precisely one wy of uilding it from the rules we sy tht ptterns re uniquely prsed. This isn t quite true once we hve removed the rckets: (0 (1 2)) turns into 0 1 2 s does ((0 1) 2). However, given the wy we use regulr expressions this does not cuse ny prolems. Note tht mny computer lnguges tht use regulr expressions hve dditionl opertors for these (see Exercise 7). However, these exist only for the convenience of the progrmmer nd don t ctully mke these regulr expressions more powerful. We sy tht they hve the sme power of expressivity. Whenever I sk you to crete regulr expression or pttern it is Definition 1 I expect you to follow. 4 In Chpter 4 we look t how to descrie lnguge like tht it cnnot e done using pttern. 16

2.5 Mtching regulr expression This lnguge is recursively defined: There re se cses consisting of the, ǫ nd ll the letters of the underlying lphet, nd three wys of constructing new ptterns from old ones. If we hve to define something for ll ptterns we my now do so y doing the following: Sy how the definition works for ech of the se cses; ssuming we hve defined the concept for p, p 1 nd p 2, sy how to otin definition for p 1 p 2, p 1 p 2 nd p. This is n exmple of recursive definition, nd you will see more of these in COMP10020. 5 Note tht in the definition of pttern no mening is ttched to ny of the opertors, or even the symols tht pper in the se cses. We only find out wht the intended mening of these is when we define when word mtches pttern. We do this y giving recursive definition s outlined ove. Definition 2. Let p e pttern over n lphet Σ nd let s e word over Σ. We sy tht s mtchinges p if one of the following cses holds: Empty word The empty word ǫ mtches the pttern ǫ; Bse cse the pttern p = x for chrcter x from Σ nd s = x; Conctention the pttern p is conctention p = (p 1 p 2 ) nd there re words s 1 nd s 2 such tht s 1 mtches p 1, s 2 mtches p 2 nd s is the conctention of s 1 nd s 2 ; Alterntive the pttern p is n lterntive p = (p 1 p 2 ) nd s mtches p 1 or p 2 (it is llowed to mtch oth); Kleene str the pttern p is of the form p = (q ) nd s cn e written s finite conctention s = s 1 s 2 s n such tht s 1, s 2,..., s n ll mtch q; this includes the cse where s is empty (nd thus n empty conctention, with n = 0). Note tht there is no word mtching the pttern. Exercise 5. Clculte ll the words mtching the ptterns ǫ nd (for Σ) respectively. Exercise 6. For the given pttern, nd the given word, employ the recursive Definition 2 to demonstrte tht the word does indeed mtch the pttern: () the pttern () nd the word. () the pttern (0 1) 10 nd the word 10010, (c) the pttern (0 1 )10 nd the word 0010, 5 Students who re on the joint honours CS nd Mths progrmme don t tke COMP10020, ut they should e le to grsp these ides without prolems. 17

(d) the pttern (c) nd the word. Exercise 7. Here re some exmples of the usge of regulr expressions in the rel world. () Print out the mnul pge for the Unix commnd grep (type mn grep to get tht pge). Now rgue tht for every regulr expression understood y grep there is regulr expression ccording to Definition 1 tht hs precisely the sme words mtching it. 6 () Give commnd-line instruction which will show ll the sujects nd uthors of mils contined in some directory. Mke sure to test your suggested nswer on computer running Linux. Hint: Try using egrep. (c) Give regulr expression tht mtches ll IP ddresses owned y the University of Mnchester. You proly wnt to use some kind of shortcut nottion. 2.6 The lnguge descried y regulr expression Given pttern we cn now define lnguge sed on it. Definition 3. Let p e regulr expression over n lphet Σ. The lnguge defined y pttern p, L(p) is the set of ll words over Σ tht mtch p. In other words L(p) = {s Σ s mtches p}. Note tht different ptterns my define the sme lnguge, for exmple L(0 ) = L(ǫ 00 ). Note tht we cn lso define the lnguge given y pttern in different wy, nmely recursively. L( ) = ; L(ǫ) = {ǫ}; L(x) = {x} for x Σ; nd for the opertions L(p 1 p 2 ) = L(p 1 ) L(p 2 ); L(p 1 p 2 ) = L(p 1 ) L(p 2 ); L(p ) = (L(p)). We cn use this description to clculte the lnguge defined y pttern s in the following exmple. L((0 1) ) = (L(0 1)) = (L(0) L(1)) = ({0} {1}) = {0, 1} This is the lnguge of ll words over the lphet {0, 1}. 6 Theoreticins wnt their ptterns with s few cses s possile so s to hve fewer cses for proofs y induction. Prcticins wnt lots of pre-defined shortcuts for ese of use. This exercise shows tht it doesn t relly mtter which version you use. 18

Exercise 8. Use this recursive definition to find the lnguges defined y the following ptterns, tht is the lnguges L(p) for the following p. () (0 1 ) () (01) 0 (c) (00) (d) ((0 1)(0 1)). Cn you descrie these lnguges in English? In order to tell computer tht we wnt it to look for words elonging to prticulr lnguge we hve to find pttern tht descries it. The following exercise lets you prctise this skill. Exercise 9. Find 7 regulr expression p over the lphet {0, 1} such tht the lnguge defined y p is the one given. Hint: For some of the exercises it my help not to think of pttern tht is somehow formed like the strings you wnt to cpture, ut to relize tht s long s there s one wy of mtching the pttern for such string, tht s good enough. () All words tht egin with 0 nd end with 1. () All words tht contin t lest two 0s. (c) All words tht contin t lest one 0 nd t lest one 1. (d) All words tht hve length t lest 2 nd whose lst ut one symol is 0. (e) All words which contin the string 11, tht is, two consecutive 1s. (f) All words whose length is t lest 3. (g) All words whose length is t most 4. (h) All words tht strt with 0 nd hve odd length. (i) All words for which every letter t n even position is 0. (j) All words tht ren t equl to the empty word. (k) All words tht contin t lest two 0s nd t most one 1. (l) All words tht ren t equl to 11 or 111. (m) All words tht contin n even numer of 0. (n) All words whose numer of 0s is divisile y 3. (o) All words tht do not contin the string 11. (p) All words tht do not contin the string 10. (q) All words tht do not contin the string 101. (r)* All words tht contin n even numer of 0s nd whose numer of 1s is divisile y 3. 7 If you find the lst few of these relly hrd then skip them for now. The tools of the next chpter should help you finish them. 19

Exercise 10. Find regulr expression p over the lphet {,, c} such tht the lnguge defined y p is the one given. () All the words tht don t contin the letter c. () All the words where every is immeditely followed y. (c) All the words tht do not contin the string. (d) All the words tht do not contin the string. 2.7 Regulr lnguges We hve therefore defined certin ctegory of lnguges, nmely those tht cn e define using regulr expression. It turns out tht these lnguges re quite importnt so we give them nme. Definition 4. A lnguge L is regulr if it is the set of ll words mtching some regulr expression, tht is, if there is pttern p such tht L = L(p). Regulr lnguges re not rre, nd there re wys of uilding them. Nonetheless in Chpter 4 we see tht not ll lnguges of interest re regulr. Proposition 2.1. Assume tht L, L 1, nd L 2 re regulr lnguges over n lphet Σ. Then the following lso re regulr lnguges. () L 1 L 2 () L 1 L 2 (c) L n (d) L Regulr expressions re very hndy when it comes to communicting to computer quite complicted collections of words tht we might e looking for. However, coming up with regulr expression tht descries precisely those words we hve in mind isn t lwys esy. In the next chpter we look t different wys of descriing lnguges wys which mke it esier for humn eings to ensure tht they re descriing the lnguge they hd in mind. There is then n lgorithm tht trnsltes one of these descriptions into pttern. 2.8 Summry In order to descrie lnguge to computer we cn use regulr expressions, lso known s ptterns. For regulr expressions we hve the notion of it eing mtched y some prticulr word. Ech regulr expression defines lnguge, nmely the set of ll words tht mtch it. We sy tht lnguge is regulr if there is pttern tht defines it. 20

Chpter 3 How do we come up with ptterns? As you should hve noticed y now (if you hve done the lst two exercises from the previous section) coming up with pttern tht descries prticulr lnguge cn e quite difficult. One hs to develop n intuition out how to think of the words in the desired lnguge, nd turn tht into chrcteristic for which pttern cn e creted. While regulr expressions give us formt tht computers understnd well, humn eings do much etter with other wys of ddressing these issues. 3.1 Using pictures Imgine someody sked you to check word of n unknown length to see whether it contins n even numer of 0s. The word is produced one letter t time. How would you go out tht? In ll likelihood you would not other to count the numer of 0s, ut insted you would wnt to rememer whether you hve so fr seen n odd, or n even, numer of 0s. Like flicking switch, when you see the first 0, you d rememer odd, when you see the second, you d switch ck (since tht s the stte you d strt with) to even, nd so forth. If one wnted to produce description of this ide it might look like this: Even 0 0 Odd Every time we see 0 we switch from the even stte to the odd stte nd vice vers. If we wnted to give this s description to someody else then mye we should lso sy wht we re doing when we see letter other thn 0, nmely sty in whtever stte we re in. Let s ssume we re tlking out words consisting of 0s nd 1s. Also, we d like to use circles for our sttes ecuse they look nicer, so we ll revite their nmes. 1 1 0 E 0 O 21

So now someody else using our picture would know wht to do if the next letter is 0, nd wht to do if it is 1. But how would someody else know where to egin? 1 1 0 E 0 We give them little rrow tht points t the stte one should strt in. However, they would still only know whether they finished in the stte clled E or the one clled O, which wouldn t tell them whether this ws the desired outcome or not. Hence we mrk the stte we wnt to e in when the word comes to n end nd now we do hve complete description of our tsk. 1 1 0 E 0 We use doule circle for stte to indicte tht if we end up there then the word we looked t stisfied our criteri. Let s try this ide on different prolem. Let s ssume we re interested in whether our word hs 0 in every position tht is multiple of three, tht is, the third, sixth, ninth, etc, letters, if they exist, re 0. So we strt in some stte, nd we don t cre whether the first letter is 0 or 1, ut we must rememer tht we ve seen one letter so tht we cn tell when we hve reched the first letter, so we d drw something like this: O O Similrly for the second letter. 0 1 0, 1 0 1 2 0, 1 0, 1 Now the third. This time something hppens: If we see 0, we re still oky, ut if we see 1 then we need to reject the word. 1 0 0, 1 1 0, 1 2 0 3 So wht now? Well, if we rech the stte lelled 4 then no mtter wht we see next, we will never ccept the word in question of stisfying our condition. So we simply think of this stte s one tht mens we won t ccept word tht ends there. All the other sttes re fine if we stop in ny of them when reding word it does stisfy our condition. So we mrk ll the other sttes s good ones to end in, nd dd the trnsitions tht keep us in stte 4. 4 22

0 1 2 3 0, 1 0, 1 0 But wht if the word is oky until stte 3? Then we hve to strt ll over gin, not cring out the next letter or the one fter, ut requiring the third one to e 0. In other words, we re in the sme position s t the strt so the esiest thing to do is not to crete stte 3, ut insted to hve tht edge go ck to stte 0. 4 1 0, 1 4 0 1 0 1 2 0, 1 0, 1 0, 1 Exercise 11. For the lnguges descried in prts () nd (c) of Exercise 9, drw picture s in the exmples just given. 3.2 Following word Consider the following picture.,, Wht hppens when we try to follow word, sy? To egin with we re in the strt stte. We see, so we follow the edge lelled..,,,, Hving followed the first letter, we cn drop it. 23

Now we see the letter, so we go once round the loop in our current stte..,, Now we hve so we follow the edge lelled,, Lstly we hve the word, so we follow the edge to the right, lelled,.,, We end up in non-ccepting stte so this wsn t one of the words we were looking for. Exercise 12. Follow these words through the ove utomton. If you cn, try to descrie the words which end up in the one desirle stte (the doule-ringed one). () () (c). 3.3 Finite stte utomt So fr we hve een very informl with our pictures. It is impossile to define n utomton without ecoming firly mthemticl. For ny of these pictures we must know which lphet Σ we re considering. Every utomton consists of numer of sttes, one stte tht is initil, numer (possily 0) of sttes tht re ccepting for every stte nd every letter x from the lphet Σ precisely one trnsition from tht stte to nother lelled with x. x Formlly we my consider the set of ll the sttes in the utomton, sy Q. One of these sttes is the one we wnt to strt in, clled the strt stte or the initil stte. Some of the sttes re the ones tht tell us if we 24

end up there we hve found the kind of word we were looking for. We cll these ccepting sttes. They form suset, sy F, of Q. The edges in the grph re nice wy of visulizing the trnsitions. Formlly wht we need is function tht tkes s its input stte nd letter from Σ nd returns stte. We cll this the trnsition function, δ. It tkes s inputs stte nd letter, so the input is pir (q, x), where q Q nd x Σ. Tht mens tht the input comes from the set Q Σ = {(q, x) q Q, x Σ}. Its output is stte, tht is n element of Q. So we hve tht δ : Q Σ Q. The forml definition then is s follows. Definition 5. Let Σ e finite lphet of symols. A (deterministic) finite 1 utomton, short DFA, over Σ consists of the following: A finite non-empty set Q of sttes; prticulr element of Q clled the strt stte (which we often denote with q ); suset F of Q consisting of the ccepting sttes; trnsition function δ which for every stte q Q nd every symol x Σ returns the next stte δ(q, x) Q, so δ is function from Q Σ to Q. When δ(q, x) = q we often write q x q. We sometimes put these four items together in qudruple nd spek of the deterministic finite utomton (Q, q, F, δ). Sometimes people lso refer to finite stte mchine. Note tht for every prticulr word there is precisely one pth through the utomton: We strt in the strt stte, nd then red off the letters one y one. The trnsition function mkes sure tht we will hve precisely one edge to follow for ech letter. When we hve followed the lst letter of the word we cn red off whether we wnt to ccept it (if we re in n ccepting stte) or not (otherwise). Tht s why these utomt re clled deterministic; we see non-deterministic utomt elow. For every word x 1 x 2 x n we hve uniquely determined sequence of sttes q, q 1,...,q n such tht (q = q 0 ) x 1 q 1 x 2 q 2 x n q n. We ccept the word if nd only if the lst stte reched, q n is n ccepting stte. Formlly it is esiest to define this condition s follows. 1 All our utomt hve finite numer of sttes nd we often drop the word finite when referring to them in these notes. 25

Definition 6. A word s = x 1 x n over Σ is ccepted y the deterministic finite utomton (Q, q, F, δ) if δ(q, x 1 ) = q 1, δ(q 1, x 2 ) = q 2,..., δ(q n 1, x n ) = q n nd q n F, tht is, q n is n ccepting stte. In prticulr, the empty word is ccepted if nd only if the strt stte is n ccepting stte. Just s ws the cse for regulr expression we cn view DFA s defining lnguge. Definition 7. We sy tht lnguge is recognized y finite utomton if it is the set of ll words ccepted y the utomton. Exercise 13. Design DFAs tht ccept precisely those words given y the lnguges descried in the vrious prts of Exercise 9 (you should lredy hve tken cre of () nd (c)). Note tht (r) is no longer mrked with here. Exercise 14. Do the sme for the lnguges descried in Exercise 10. 3.4 Non-deterministic utomt Sometimes it cn e esier to drw n utomton tht is non-deterministic. Wht this mens is tht from some stte, there my e severl edges lelled with the sme letter. As result, there is then no longer unique pth when we follow prticulr word through the utomton. Hence the procedure of following word through n utomton ecomes more complicted we hve to consider numer of possile pths we might tke. When trying to drw n utomton recognizing the lnguge from Exercise 9 () you my hve een tempted to drw such n utomton. The lnguge in question is tht of ll words over {0, 1} tht egin with 0 nd nd with 1. We know tht ny word we ccept hs to strt with 0, nd if the first letter we see is 1 then we never ccept the word, so we egin y drwing something like this. 0, 1 2 1 0 0 1 Now we don t cre wht hppens until we rech the lst symol, nd when tht is 1 we wnt to ccept the word. (If it wsn t the lst letter then we shouldn t ccept the word.) The following would do tht jo: 26

0, 1 2 0, 1 1 0 0 1 0, 1 1 3 But now when we re in stte 1 nd see 1 there re two edges we might follow: The loop tht leds gin to stte 1 or the edge tht leds to stte 3. So now when we follow the word 011 through the utomton there re two possile pths: From stte 0, red 0, go to stte 1. Red 1 nd go to stte 3. Red 1 nd go to stte 2. From stte 0, red 0, go to stte 1. Red 1 nd go to stte 1. Red 1 nd go to stte 3. Finish not ccepting the word. Finish ccepting the word. We sy tht the utomton ccepts the word if there is t lest one such pth tht ends in n ccepting stte. So how does the definition of non-deterministic utomton differ from tht of deterministic one? We still hve set of sttes Q, prticulr strt stte q in Q, nd set of ccepting sttes F Q. However, it no longer is the cse tht for every stte nd every letter from x there is precisely one edge lelled with x, there my e severl. Wht we no longer hve is trnsition function. Insted we hve trnsition reltion. 2 Given stte q, letter x nd nother stte q the reltion δ tells us whether or not there is n edge lelled x from q to q. Exercise 15. Go ck nd check your solutions to Exercises 13 nd 14. Were they ll deterministic s required? If not, redo them. We cn turn this ide into forml definition. Definition 8. A non-deterministic finite utomton, short NFA, is given y finite non-empty set Q of sttes, strt stte q in Q, suset F of Q of ccepting sttes s well s trnsition reltion δ which reltes pir consisting of stte nd letter to stte. We often write q x q if (q, x) is δ-relted to q. We cn now lso sy when n NFA ccepts word. 2 You will meet reltions lso in COMP10020. 27

Definition 9. A word s = x 1 x n over Σ is ccepted y the nondeterministic finite utomton (Q, q, F, δ) if there re sttes q 0 = q, q 1,...,q n such tht for ll 0 i < n, δ reltes (q i, x i ) to q i+1 nd such tht q n F, tht is, q n is n ccepting stte. The lnguge recognized y n NFA is the set of ll words it ccepts. An NFA therefore ccepts word x 1 x 2 x n if there re sttes such tht (q = q 0 ) q = q 0, q 1,...,q n x 1 q 1 x 2 q 2 x n q n, nd q n is n ccepting stte. However, the sequence of sttes is no longer uniquely determined, nd there could potentilly e mny. Exercise 16. Consider the following NFA., Which of the following words re ccepted y the utomton? ǫ,,,,,,,,,,. Cn you descrie the lnguge consisting of ll the words ccepted y this utomton? Note tht in the definition of NFA there is no rule tht sys tht for given stte there must e n edge for every lel. Tht mens tht when following word through non-deterministic utomton we might find ourselves stuck in stte ecuse there is no edge out of it for the letter we currently see. If tht hppens then we know tht long our current route, the word will not e ccepted y the utomton. While this my e confusing t first sight, it is ctully quite convenient. It mens tht pictures of non-deterministic utomt cn e quite smll. Tke the ove exmple of finding n NFA tht ccepts ll the words tht strt with 0 nd end with 1. We cn give much smller utomton tht does the sme jo. 0 0, 1 1 You should spend moment convincing yourself tht this utomton does indeed ccept precisely the words climed (tht is, the sme ones s the previous utomton). 3 This is such useful convention tht it is usully lso dopted when drwing deterministic utomt. Consider the prolem of designing DFA tht recognizes those words over the lphet {0, 1} of length t lest 2 for which the first letter is 0 nd the second letter is 1. By concentrting on wht it tkes to get word to n ccepting stte one might well drw something like this: 3 Note tht unless we hve to descrie the utomton in nother wy, or otherwise hve resons to e le to refer to prticulr stte, there is no reson for giving the sttes nmes in the picture. 28

0, 1 0 1 0 1 2 This is perfectly good picture of deterministic finite utomton. However, not ll the sttes, nd not ll the trnsitions, re drwn for this utomton: Aove we sid tht for every stte, nd every letter from the lphet, there must e trnsition from tht stte lelled with tht letter. Here, however, there is no trnsition lelled 1 from the stte 0, nd no trnsition lelled 0 from the stte 1. Wht does the utomton do if it sees 1 in stte 0, or 0 in stte 1? Well, it discrds the word s non-cceptle, in mnner of speking. We cn complete the ove picture to show ll required sttes y ssuming there s hidden stte tht we my think of s dump. As soon s we hve determined tht prticulr word cn t e ccepted we send it off in tht dump stte (which is certinly not n ccepting stte), nd there s no wy out of tht stte. So ll the trnsitions not shown in the picture ove go to tht hidden stte. With the hidden stte drwn our utomton looks like this: 0 1 0 1 2 1 0 0, 1 3 0, 1 This picture is quite it more complicted thn the previous one, ut oth descrie the sme DFA, nd so contin precisely the sme informtion. I m perfectly hppy for you to drw utomt either wy when it comes to exm questions or ssessed coursework. Exercise 17. Consider the following DFA. Which of its sttes re dump sttes, nd which re unrechle? Drw the simplest utomton recognizing the sme lnguge., c c,, c c Descrie the lnguge recognized y the utomton. Exercise 18. Go through the utomt you hve drwn for Exercise 13 nd 14. Identify ny dump sttes in them. 3.5 Deterministic versus non-deterministic So fr we hve found the following differences etween deterministic nd nondeterministic utomt: For the sme prolem it is usully esier to design non-deterministic utomton, nd the resulting utomt re often smller. 29

On the other hnd, following word through deterministic utomton is strightforwrd, nd so deciding whether the word is ccepted is esy. For non-deterministic utomt we hve to find ll the possile pths word might move long, nd decide whether ny of them leds to n ccepting stte. Hence finding the lnguge recognized y n NFA is usully hrder thn to do the sme thing for DFA of similr size. So oth hve dvntges nd disdvntges. But how different re they relly? Clerly every deterministic utomton cn e viewed s nondeterministic one since it stisfies ll the required criteri. It therefore mkes sense to wonder whether there re things we cn do with non-deterministic utomt tht cn t e done with deterministic ones. It turns out tht this is flse. Theorem 3.1. For every non-deterministic utomton there is deterministic one tht recognizes precisely the sme words. Algorithm 1, exmple Before looking t the generl cse of Algorithm 1 we consider n exmple. Consider the following NFA. 0, 1 2 We wnt to construct deterministic utomton from this step y step. We strt with stte 0, which we just copy, so it is oth initil nd n ccepting stte in our new utomton. 0 With, we cn go from stte 0 to sttes 1 nd 2, so we invent new stte we cll 12 (think of it s eing set contining oth, stte 1 nd stte 2). Becuse 2 is n ccepting stte we mke 12 n ccepting stte too. 0 12 With letter we cn go from stte 0 to stte 1, so we need stte 1 (think of it s stte {1}) in the new utomton too. 1 0 12 Now we hve to consider the sttes we hve just creted. In the originl utomton from stte 1, we cn t go nywhere with, ut with we cn go to stte 2, so we introduce n ccepting stte 2 (thought of s {2}) into our new utomton. 30