COMPILATION OF AUTOMATA FROM MORPHOLOGICAL TWO-LEVEL RULES

Similar documents
A L A BA M A L A W R E V IE W

P a g e 5 1 of R e p o r t P B 4 / 0 9

OH BOY! Story. N a r r a t iv e a n d o bj e c t s th ea t e r Fo r a l l a g e s, fr o m th e a ge of 9

P a g e 3 6 of R e p o r t P B 4 / 0 9

Software Process Models there are many process model s in th e li t e ra t u re, s om e a r e prescriptions and some are descriptions you need to mode

T h e C S E T I P r o j e c t

176 5 t h Fl oo r. 337 P o ly me r Ma te ri al s

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

Geometric Predicates P r og r a m s need t o t es t r ela t ive p os it ions of p oint s b a s ed on t heir coor d ina t es. S im p le exa m p les ( i

(C) Pavel Sedach and Prep101 1

Table of C on t en t s Global Campus 21 in N umbe r s R e g ional Capac it y D e v e lopme nt in E-L e ar ning Structure a n d C o m p o n en ts R ea

STANDARDIZATION OF BLENDED NECTAR USING BANANA PSEUDOSTEM SAP AND MANGO PULP SANTOSH VIJAYBHAI PATEL

c. What is the average rate of change of f on the interval [, ]? Answer: d. What is a local minimum value of f? Answer: 5 e. On what interval(s) is f

Last 4 Digits of USC ID:

I M P O R T A N T S A F E T Y I N S T R U C T I O N S W h e n u s i n g t h i s e l e c t r o n i c d e v i c e, b a s i c p r e c a u t i o n s s h o

Nucleus. Electron Cloud

Solutions and Ions. Pure Substances


9/20/2017. Elements are Pure Substances that cannot be broken down into simpler substances by chemical change (contain Only One Type of Atom)

Guide to the Extended Step-Pyramid Periodic Table

o Alphabet Recitation

Marks for each question are as indicated in [] brackets.

The Periodic Table of Elements

The distribution of characters, bi- and trigrams in the Uppsala 70 million words Swedish newspaper corpus

The Periodic Table. Periodic Properties. Can you explain this graph? Valence Electrons. Valence Electrons. Paramagnetism

H STO RY OF TH E SA NT

CHEM 251 (Fall-2003) Final Exam (100 pts)

02/05/09 Last 4 Digits of USC ID: Dr. Jessica Parr

I N A C O M P L E X W O R L D

Agenda Rationale for ETG S eek ing I d eas ETG fram ew ork and res u lts 2

PERIODIC TABLE OF THE ELEMENTS

Element Cube Project (x2)

Speed of light c = m/s. x n e a x d x = 1. 2 n+1 a n π a. He Li Ne Na Ar K Ni 58.

Beechwood Music Department Staff

CATAVASII LA NAȘTEREA DOMNULUI DUMNEZEU ȘI MÂNTUITORULUI NOSTRU, IISUS HRISTOS. CÂNTAREA I-A. Ήχος Πα. to os se e e na aș te e e slă ă ă vi i i i i

Chemistry 431 Practice Final Exam Fall Hours

Executive Committee and Officers ( )

THIS PAGE DECLASSIFIED IAW EO IRIS u blic Record. Key I fo mation. Ma n: AIR MATERIEL COMM ND. Adm ni trative Mar ings.

CHEM 10113, Quiz 5 October 26, 2011

Made the FIRST periodic table

C o r p o r a t e l i f e i n A n c i e n t I n d i a e x p r e s s e d i t s e l f

MANY ELECTRON ATOMS Chapter 15

CHEM 108 (Spring-2008) Exam. 3 (105 pts)

Dangote Flour Mills Plc

Faculty of Natural and Agricultural Sciences Chemistry Department. Semester Test 1. Analytical Chemistry CMY 283. Time: 120 min Marks: 100 Pages: 6

Atoms and the Periodic Table

Radiometric Dating (tap anywhere)

Faculty of Natural and Agricultural Sciences Chemistry Department. Semester Test 1 MEMO. Analytical Chemistry CMY 283

Chemistry 2 Exam Roane State Academic Festival. Name (print neatly) School

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

Ch. 9 NOTES ~ Chemical Bonding NOTE: Vocabulary terms are in boldface and underlined. Supporting details are in italics.

(please print) (1) (18) H IIA IIIA IVA VA VIA VIIA He (2) (13) (14) (15) (16) (17)

If anything confuses you or is not clear, raise your hand and ask!

INSTRUCTIONS: Exam III. November 10, 1999 Lab Section

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

K E L LY T H O M P S O N

2 tel

8. Relax and do well.

CHEM 107 (Spring-2004) Exam 2 (100 pts)

Advanced Chemistry. Mrs. Klingaman. Chapter 5: Name:

M10/4/CHEMI/SPM/ENG/TZ2/XX+ CHEMISTRY. Wednesday 12 May 2010 (afternoon) 45 minutes INSTRUCTIONS TO CANDIDATES

Chapter 12 The Atom & Periodic Table- part 2

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

single-layer transition metal dichalcogenides MC2

Secondary Support Pack. be introduced to some of the different elements within the periodic table;

Using the Periodic Table

Chemistry 185 Exam #2 - A November 5, Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start.

Topic 3: Periodicity OBJECTIVES FOR TODAY: Fall in love with the Periodic Table, Interpret trends in atomic radii, ionic radii, ionization energies &

CHEM 172 EXAMINATION 1. January 15, 2009

HANDOUT SET GENERAL CHEMISTRY II

CMSC 313 Lecture 17 Postulates & Theorems of Boolean Algebra Semiconductors CMOS Logic Gates

POLYTECHNIC OF NAMIBIA

7. Relax and do well.

Circle the letters only. NO ANSWERS in the Columns!

5 questions, 3 points each, 15 points total possible. 26 Fe Cu Ni Co Pd Ag Ru 101.

K. 27 Co. 28 Ni. 29 Cu Rb. 46 Pd. 45 Rh. 47 Ag Cs Ir. 78 Pt.

2 (27) 3 (26) 4 (21) 5 (18) 6 (8) Total (200) Periodic Table

Instructions. 1. Do not open the exam until you are told to start.

Metallurgical Chemistry. An Audio Course for Students

CHEM 130 Exp. 8: Molecular Models

Lab Day and Time: Instructions. 1. Do not open the exam until you are told to start. Page # Points possible Points awarded

1 of 5 14/10/ :21

CHEM Come to the PASS workshop with your mock exam complete. During the workshop you can work with other students to review your work.

Why all the repeating Why all the repeating Why all the repeating Why all the repeating

CHM 101 PRACTICE TEST 1 Page 1 of 4

8. Relax and do well.

F l a s h-b a s e d S S D s i n E n t e r p r i s e F l a s h-b a s e d S S D s ( S o-s ltiad t e D r i v e s ) a r e b e c o m i n g a n a t t r a c

CHEM 107 (Spring-2005) Exam 3 (100 pts)

EKOLOGIE EN SYSTEMATIEK. T h is p a p e r n o t to be c i t e d w ith o u t p r i o r r e f e r e n c e to th e a u th o r. PRIMARY PRODUCTIVITY.

8. Relax and do well.

Lesson Ten. What role does energy play in chemical reactions? Grade 8. Science. 90 minutes ENGLISH LANGUAGE ARTS

4.01 Elements, Symbols and Periodic Table

Building Harmony and Success

NAME: FIRST EXAMINATION

Atomic Structure & Interatomic Bonding

PART 1 Introduction to Theory of Solids

30 Zn(s) 45 Rh. Pd(s) Ag(s) Cd(s) In(s) Sn(s) white. 77 Ir. Pt(s) Au. Hg(l) Tl. 109 Mt. 111 Uuu. 112 Uub. 110 Uun. 65 Tb. 62 Sm. 64 Gd. 63 Eu.

DO NOW: Retrieve your projects. We will be reviewing them again today. Textbook pg 23, answer questions 1-3. Use the section 1.2 to help you.

CLASS TEST GRADE 11. PHYSICAL SCIENCES: CHEMISTRY Test 4: Matter and materials 1

Transcription:

Kimmo Koskenniemi Re se ar ch Unit for Co mp ut at io na l Li ng ui st ic s University of Helsinki, Hallituskatu 11 SF-00100 Helsinki, Finland COMPILATION OF AUTOMATA FROM MORPHOLOGICAL TWO-LEVEL RULES 1. INTRODUCTION T h e t w o - l e v e l m o d e l is a f r a m e w o r k for d e s c r i b i n g word inflection. The mo de l co ns is ts of a l e xicon system and a f o r m a l i s m for t w o - l e v e l rules. The l e xicon system defines all p o s s i b l e l e x i c a l r e p r e s e n t a t i o n s of w o r d - f o r m s w h e r e a s the r u l e s e x p r e s s the p e r m i t t e d r e l a t i o n s b e t w e e n l e x i c a l and surface representations. Word recognition is thus reduced into the question of finding a p e r m i s s i b l e le xi ca l repr es en ta ti on which is in a proper rela t i o n to the surface form. Similarly, ge ne ra ti on is the inverse where the l e xical representation is k n o w n and the ta s k is to f i nd a s u r f a c e r e p r e s e n t a t i o n w h i c h is in a proper relation to it. W i t h i n the t w o - l e v e l mo de l these r e lations pertaining to the r u l e c o m p o n e n t h a v e b e e n e x p r e s s e d in two ways. A r u l e f o r m a l i s m ha s b e e n us e d for c o m m u n i c a t i n g the idea of the r u l e s, w h e r e a s t h e a c t u a l i m p l e m e n t a t i o n s h a v e b e e n a c c o m p l i s h e d by h a n d - c o d i n g t h e r u l e s as f i n i t e s t a t e automata. The c l o s e c o nnection be tw ee n rules and finite state mach in es has fa ci li ta te d this hand-coding. Expressing rules as numbers in a transition matr ix is, of course, not optimal. A l t h o u g h it has p r o v e n to be feasible, it is tedious. It al s o tends to dist r a c t the linguist's thoughts f r o m m o r p h o p h o n o l o g i c a 1 v a r i a t i o n s to t e c h n i c a l m a t t e r s. F u r t h e r m o r e, h a n d c o m p i l e d a u t o m a t a are o f t e n no t q u i t e c o n s i s t e n t w i t h i n t e n d e d rules. D i s c r e p a n c i e s a r i s e b e c a u s e the design of rule au to ma ta is often affected by assumptions -143- Compilation of automata from morphological two-level rules Kimmo Koskenniemi, pages 143-149

on the r e g u l a r i t y of a c t u a l w o r d - f o r m s. Th us, s u c h a u t o m a t a u s u a l l y function c o r r e c t l y with most of the data but they ha v e a less cl ea r rela t i o n with orig i n a l intended rules. The rule comp i l e r de sc ri be d b e l o w rectifies this p r o b l e m by letting the ling uist write rules in a true rule f o r m a l i s m w h i l e the computer prod uces the automata m e c h anically. These can then be used in c o n v e n t i o n a l t w o - l e v e l programs, both for t e s t i n g a n d a n d f o r p r o d u c t i o n use. S e v e r a l t w o - l e v e l d e scriptions are now on their way towards c o m p l e t i o n using the c o m p i l e r. 2. THE FORMALISM OF TWO-LEVEL RULES T h e a c t u a l r u l e f o r m a l i s m s u p p o r t e d by the t w o - l e v e l c o m p i l e r d i f f e r s o n l y s l i g h t l y f r o m the o r i g i n a l f o r m a l i s m prop os ed in Koskenniemi (1983). One of the di ff er en ce s is the us e of l i n e a r r e p r e s e n t a t i o n for p a irs, t h u s a :o is u s e d i n s t e a d of F u r t h e r m o r e, the e q u a l s i g n is no l o n g e r u s e d for denoting the full lexical or surface alphabet. A surface v o w e l is w r i t t e n s i m p l y as :V an d a l e x i c a l a as a :. O t h e r basic elem e n t s are as they used to be; (1) A s e q u e n c e of e l e m e n t s a r e w r i t t e n o n e a f t e r a n o t h e r, t h u s :V :V s t a n d s for two s u c c e s s i v e surface vowels. (2) A l t e r n a t i v e e l e m e n t s are separated by a v e r t i c a l bar and e n c l o s e d in s q u a r e b r a c k e t s, e.g. [:i :j] st ands for either a su rf ac e i or a su rf ac e j. (3) Iteration is indicated with a superscript asterisk or p l u s sign, e.g. :C* s t a n d s for z e r o or m o r e surface c o nsonants whereas :C'* requires at least one surface consonant. R u l e s w i t h o p e r a t o r s => <= an d <=> e x i s t as b e f o r e and they are interpreted as before. A rule: I : j => :V :V -1^4-144

s t a t e s t h a t e v e r y o c c u r r e n c e of a p a i r I:j m u s t be in a c o n t e x t of.. :V :V.. i.e. b e t w e e n two s u r f a c e v o w e l s. A rule: I:j <= :V :V s t a t e s th a t b e t w e e n s u r f a c e v o w e l s a l e x i c a l I has no o t h e r p o s s i b l e r e al iz at io ns than a j. Ru le s with operators <=> are c o mbinations of those with <= and =>. There is one new type of rules with an operator >v<=. This r u l e f o r b i d s a n y o c c u r r e n c e s of LC C P RC, i.e. it f o r b i d s C P in the context LC RC. A n o t h e r d i f f e r e n c e is in the f o r m a l i s m for c o l l a p s i n g s e v e r a l s i m i l a r r u l e s i n to o n e rule. T h e i n i t i a l f o r m a l i s m used an g l e brackets, but this has been replaced by e q u i v a l e n t means using so c a l l e d "where" clauses. If W denotes a morphop h o n e m e for v o w e l d o u b l i n g t h e n a r u l e for v o w e l d o u b l i n g is expressed in the present f o r m a l i s m as: W : x <=> :x wh er e x in V; T h e d e f i n i t i o n of the r u l e c o m p o n e n t of t w o - l e v e l d e scriptions cons is ts of six sections: - a surface alph ab et as a list of surface characters - s u b s e t s of the s u r f a c e a l p h a b e t w h i c h ar e us e d in the rules - a lexical alphabet - subsets of the lexical alphabet - d e finitions for a b b r e v i a t i o n s or subexpressions used in. the rules - two-level rules. A sample two-level description is gi ve n below: -145-145

Lexical Alphabet a b c d e f g h i j k l m n o p q r s t u v w X y z S a o al a2 E I = W; Lexical Sets V = a e i o u y ^ a o a l a 2 E I W ; C = b c d f g h j k l m n p q r s t v w x z ; Di ac ri ti cs = / "; Surface Alphabet a b c d e f g h i j k l m n o p q r s t u v w X y X a a 5; Surface Sets V = a e i o u y ^ a o ; C = b c d f g h j k l m n p q r s t v w x z ; Definitions Defaults = al:a a2:a E:e I;i; Rules "Vowel doubling" W:X => :X ; where X in V; "Suppressed doubling" W:0 <=> I;; I: ; W ; V ; " Stem final V" [al:0 a2: o E:0 i;e] < = > I :; " Plural I" I:j <=> :V :V; N o t e t h a t s u r f a c e a n d l e x i c a l a l p h a b e t s a r e d e c l a r e d s e p a r a t e l y. T h i s g u a r a n t e e s t h a t the r o l e of e a c h s e g m e n t is u n i q u e l y determined. The sets are separate a l s o because it is no t a l w a y s e v i d e n t e.g. w h i c h s e g m e n t s a r e c o n s i d e r e d to be v o w e l s on the le xi ca l level. The first rule represents a set of s e v e r a l rules; W:a => :a ; W:e => :e ; W:6 => :6 ; i s i n t e r p r e t e d d u m m y v a r i a b l e X o c c u r s on b o t h s i d e s of the ru le. If it w o u l d -1A6-146

occur o n l y on the context side then the a b b r e v i a t i o n denotes o n e s i n g l e r u l e w i t h m a n y c o n t e x t p a r t s (one fo r e a c h possibility of X ). Another effect of the e x pansion of "where" c l a u s e s is the i n t r o d u c t i o n of s o m e n e w c h a r a c t e r pairs. P a i r s W:a, W:e, W:o do no t o c c u r a n y w h e r e in the d e s c r i p t i o n but t h e y are i m p l i c i t l y included. 3. STEPS OF THE COMPILATION The c o m p i l a t i o n of the t w o - l e v e l de s c r i p t i o n into finite s t a t e a u t o m a t a p r o c e e d s in s e v e r a l steps. T h e c o m p u t a t i o n relies e s s e n t i a l l y on Ron Kaplan's program pa ck ag es (FSM, FST) for m a n i p u l a t i n g f i n i t e s t a t e m a c h i n e s a n d t r a n s d u c e r s. T h e c o l l e c t i o n of r u l e s ha s to be t r e a t e d as a w h o l e b e c a u s e the set of c h a r a c t e r p a i r s (CPS) m i g h t be c h a n g e d if s o m e r u l e s ar e a l t e r e d t h us c h a n g i n g the i n t e r p r e t a t i o n of s o m e o t h e r rules. The steps of the c o m p i l a t i o n are: ( 1 ) ( 2 ) (3) (4) (5) T r a n s f o r m a t i o n of the t w o - l e v e l r u l e d e s c r i p t i o n i n t o a r e c u r s i v e l i s t e x p r e s s i o n w h e r e r u l e components and pairs are identified. Expa ns io n of "where" clauses in the rules. C o l l e c t i n g a l l pairs e x p l i c i t l y ment ioned in rules and d e f i n i t i o n s in a d d i t i o n to the d e f a u l t set of a l l x:x w h e r e x is b o t h a l e x i c a l and a s u r f a c e c h a r a c t e r. C o m p u t i n g the e x a c t i n t e r p r e t a t i o n of p a i r s w h i c h a r e not c o n c r e t e p a i r s (whe re b o t h the l e x i c a l an d the lexical comp onents are single characters). Some p a i r s l i k e :V l e a v e o n e l e v e l f u l l y o p e n a n d o t h e r s may use the defined subsets such as W:V (character W c o r r e s p o n d i n g to a n y of the v o w e l s b u t not to a z e r o 0). A l l s u c h ( p a r t i a l l y ) u n s p e c i f i e d p a i r s X:Y d e n o t e the s u b s e t of C P S c o n s i s t i n g of p a i r s x:y w h e r e x is X or is in X an d y is Y or is in Y. E x p a n d e a c h a b b r e v i a t i o n of the a b o v e t y p e in t o an alternation. Insert the defined e x pression in pl a c e of the name of the expression. -147-147

(6) C o m p i l e the c o mponents of the rules (correspondence parts, left cont exts and right contexts) into finite state machines. (7) Sp li t ru les with the operator <=> into one rule with operator => and another with <=. (8) E x p a n d r u l e s w i t h o p e r a t o r <= and w i t h m u l t i p l e cont ex ts into distinct rules with one context each. (9) Compile the individual comp on en t rules separately. (10) M e r g e the au to ma ta re su lt in g from a s i n g l e or ig in al rule into one rule au tomaton by in tersecting them. In the p r e s e n t v e r s i o n of the c o m p i l e r e a c h r u l e as d e f i n e d by the u s er is c o m p i l e d into a s i n g l e a u t o m a t o n. If the e x p a n s i o n or c o m p i l a t i o n s p l i t s the r u l e in t o s u b p a r t s t h e s e are f i n a l l y c o m b i n e d in t o a s i n g l e m a c h i n e by the c o m p i l e r. T h e c o m p i l a t i o n is d o n e on a X e r o x 1108 L i s p m a c h i n e w i t h p r o g r a m s w r i t t e n in I n t e r l i s p - D. T h e r e s u l t i n g a u t o m a t a c a n t h e n be us e d e i t h e r on the L i s p M a c h i n e or t r a n s p o r t e d to o t h e r sy st em s. In o r d e r to be u s e d by the p r e s e n t v e r s i o n of the Pascal t w o - l e v e l pr o g r a m the automata are c o n v e r t e d into a t a b u l a r f o r m a t w h i c h c a n be r e a d i l y used. T h e f o r m a t is s l i g h t l y di ff er en t from the o r iginal one g i v e n in Koskenniemi (1983) b u t it is s i g n i f i c a n t l y f a s t e r to r e a d in. S u c h automata ha ve been s u c c e s s f u l l y used on MS-DOS mi cr o comp uters s u c h as I B M PC a n d O l i v e t t i M 2 4. M a r t t i N y m a n a t t h e U n i v e r s i t y of H e l s i n k i working on one d e scription for M o d e r n Gr ee k and another for C l a s s i c a l Greek and Jo rm a L u utonen at the U n i v e r s i t y of T u r k u is w o r k i n g on o n e for C h e r e m i s. O l l i B l å b e r g has r e f o r m u l a t e d his Swedish d e s c ription in terms of the present compiler. T h e c o m p i l e r w a s w r i t t e n d u r i n g the s u m m e r 1 9 85 at the C e n t e r for S t u d i e s on L a n g u a g e a n d I n f o r m a t i o n at S t a n f o r d University. In ad di ti on to the finite state package written by Ron K a p l a n the c o m p i l e r ut il iz es Kaplan's concept of co mp il in g c o m p l e x rules with operator => and s e v e r a l context parts. The c o m p i l e r w a s p r e s e n t e d at a s y m p o s i u m o n f i n i t e s t a t e m o r p h o l o g y on July, 29-30 1985 at CSLI. The co mp il er has a l so s t i m u l a t e d some p a r a l l e l efforts (Bear, in press, Ritchie et. al. 1985, Kinnunen, in preparation). -148-148

REFERENCES Bear, John, (in press). A m o r p h o l o g i c a l r e c o g n i z e r w i t h syntactic and p h o n o l o g i c a l rules. P r oceedings of COLING- 86, Bonn. Kinnunen, Maarit, (in preparation). M o r f o l o g i s t e n saantojen kaan ta mi ne n a a r e l l i s i k s i automaateiksi. (The c o m p i l a t i o n of m o r p h o l o g i c a l r u l e s into f i n i t e s t a t e a u t o m a t a ) M a s t e r s t h e s i s. D e p a r t m e n t of C o m p u t e r S c i e n c e, U n i v e r s i t y of Helsinki. K o s k e n n i e m i, K. 1 9 8 3. T w o - l e v e l m o r p h o l o g y : A g e n e r a l c o m p u t a t i o n a l m o d e l f o r w o r d - f o r m r e c o g n i t i o n a n d production. Publications, No. 11, U n i v e r s i t y of Helsinki, De partment of Ge ne ra l Linguistics. Helsinki. ---, 1 9 8 4. A g e n e r a l c o m p u t a t i o n a l m o d e l for w o r d - f o r m recognition and production. P r o c eedings of COLING-84. pp. 178-181. R i t c h i e, G.D., S.G. P u l m a n, and G.J. R u s s e l, 1985. D i c t i o n a r y and m o r p h o l o g i c a l a n a l y z e r ( P r o t o t y p e ), U s e r guide: V e r s i o n 1.12. D e p a r t m e n t of A r t i f i c i a l I n t e l l i g e n c e, University of Edinburgh. - 1 ^ 9-149