Kimmo Koskenniemi Re se ar ch Unit for Co mp ut at io na l Li ng ui st ic s University of Helsinki, Hallituskatu 11 SF-00100 Helsinki, Finland COMPILATION OF AUTOMATA FROM MORPHOLOGICAL TWO-LEVEL RULES 1. INTRODUCTION T h e t w o - l e v e l m o d e l is a f r a m e w o r k for d e s c r i b i n g word inflection. The mo de l co ns is ts of a l e xicon system and a f o r m a l i s m for t w o - l e v e l rules. The l e xicon system defines all p o s s i b l e l e x i c a l r e p r e s e n t a t i o n s of w o r d - f o r m s w h e r e a s the r u l e s e x p r e s s the p e r m i t t e d r e l a t i o n s b e t w e e n l e x i c a l and surface representations. Word recognition is thus reduced into the question of finding a p e r m i s s i b l e le xi ca l repr es en ta ti on which is in a proper rela t i o n to the surface form. Similarly, ge ne ra ti on is the inverse where the l e xical representation is k n o w n and the ta s k is to f i nd a s u r f a c e r e p r e s e n t a t i o n w h i c h is in a proper relation to it. W i t h i n the t w o - l e v e l mo de l these r e lations pertaining to the r u l e c o m p o n e n t h a v e b e e n e x p r e s s e d in two ways. A r u l e f o r m a l i s m ha s b e e n us e d for c o m m u n i c a t i n g the idea of the r u l e s, w h e r e a s t h e a c t u a l i m p l e m e n t a t i o n s h a v e b e e n a c c o m p l i s h e d by h a n d - c o d i n g t h e r u l e s as f i n i t e s t a t e automata. The c l o s e c o nnection be tw ee n rules and finite state mach in es has fa ci li ta te d this hand-coding. Expressing rules as numbers in a transition matr ix is, of course, not optimal. A l t h o u g h it has p r o v e n to be feasible, it is tedious. It al s o tends to dist r a c t the linguist's thoughts f r o m m o r p h o p h o n o l o g i c a 1 v a r i a t i o n s to t e c h n i c a l m a t t e r s. F u r t h e r m o r e, h a n d c o m p i l e d a u t o m a t a are o f t e n no t q u i t e c o n s i s t e n t w i t h i n t e n d e d rules. D i s c r e p a n c i e s a r i s e b e c a u s e the design of rule au to ma ta is often affected by assumptions -143- Compilation of automata from morphological two-level rules Kimmo Koskenniemi, pages 143-149
on the r e g u l a r i t y of a c t u a l w o r d - f o r m s. Th us, s u c h a u t o m a t a u s u a l l y function c o r r e c t l y with most of the data but they ha v e a less cl ea r rela t i o n with orig i n a l intended rules. The rule comp i l e r de sc ri be d b e l o w rectifies this p r o b l e m by letting the ling uist write rules in a true rule f o r m a l i s m w h i l e the computer prod uces the automata m e c h anically. These can then be used in c o n v e n t i o n a l t w o - l e v e l programs, both for t e s t i n g a n d a n d f o r p r o d u c t i o n use. S e v e r a l t w o - l e v e l d e scriptions are now on their way towards c o m p l e t i o n using the c o m p i l e r. 2. THE FORMALISM OF TWO-LEVEL RULES T h e a c t u a l r u l e f o r m a l i s m s u p p o r t e d by the t w o - l e v e l c o m p i l e r d i f f e r s o n l y s l i g h t l y f r o m the o r i g i n a l f o r m a l i s m prop os ed in Koskenniemi (1983). One of the di ff er en ce s is the us e of l i n e a r r e p r e s e n t a t i o n for p a irs, t h u s a :o is u s e d i n s t e a d of F u r t h e r m o r e, the e q u a l s i g n is no l o n g e r u s e d for denoting the full lexical or surface alphabet. A surface v o w e l is w r i t t e n s i m p l y as :V an d a l e x i c a l a as a :. O t h e r basic elem e n t s are as they used to be; (1) A s e q u e n c e of e l e m e n t s a r e w r i t t e n o n e a f t e r a n o t h e r, t h u s :V :V s t a n d s for two s u c c e s s i v e surface vowels. (2) A l t e r n a t i v e e l e m e n t s are separated by a v e r t i c a l bar and e n c l o s e d in s q u a r e b r a c k e t s, e.g. [:i :j] st ands for either a su rf ac e i or a su rf ac e j. (3) Iteration is indicated with a superscript asterisk or p l u s sign, e.g. :C* s t a n d s for z e r o or m o r e surface c o nsonants whereas :C'* requires at least one surface consonant. R u l e s w i t h o p e r a t o r s => <= an d <=> e x i s t as b e f o r e and they are interpreted as before. A rule: I : j => :V :V -1^4-144
s t a t e s t h a t e v e r y o c c u r r e n c e of a p a i r I:j m u s t be in a c o n t e x t of.. :V :V.. i.e. b e t w e e n two s u r f a c e v o w e l s. A rule: I:j <= :V :V s t a t e s th a t b e t w e e n s u r f a c e v o w e l s a l e x i c a l I has no o t h e r p o s s i b l e r e al iz at io ns than a j. Ru le s with operators <=> are c o mbinations of those with <= and =>. There is one new type of rules with an operator >v<=. This r u l e f o r b i d s a n y o c c u r r e n c e s of LC C P RC, i.e. it f o r b i d s C P in the context LC RC. A n o t h e r d i f f e r e n c e is in the f o r m a l i s m for c o l l a p s i n g s e v e r a l s i m i l a r r u l e s i n to o n e rule. T h e i n i t i a l f o r m a l i s m used an g l e brackets, but this has been replaced by e q u i v a l e n t means using so c a l l e d "where" clauses. If W denotes a morphop h o n e m e for v o w e l d o u b l i n g t h e n a r u l e for v o w e l d o u b l i n g is expressed in the present f o r m a l i s m as: W : x <=> :x wh er e x in V; T h e d e f i n i t i o n of the r u l e c o m p o n e n t of t w o - l e v e l d e scriptions cons is ts of six sections: - a surface alph ab et as a list of surface characters - s u b s e t s of the s u r f a c e a l p h a b e t w h i c h ar e us e d in the rules - a lexical alphabet - subsets of the lexical alphabet - d e finitions for a b b r e v i a t i o n s or subexpressions used in. the rules - two-level rules. A sample two-level description is gi ve n below: -145-145
Lexical Alphabet a b c d e f g h i j k l m n o p q r s t u v w X y z S a o al a2 E I = W; Lexical Sets V = a e i o u y ^ a o a l a 2 E I W ; C = b c d f g h j k l m n p q r s t v w x z ; Di ac ri ti cs = / "; Surface Alphabet a b c d e f g h i j k l m n o p q r s t u v w X y X a a 5; Surface Sets V = a e i o u y ^ a o ; C = b c d f g h j k l m n p q r s t v w x z ; Definitions Defaults = al:a a2:a E:e I;i; Rules "Vowel doubling" W:X => :X ; where X in V; "Suppressed doubling" W:0 <=> I;; I: ; W ; V ; " Stem final V" [al:0 a2: o E:0 i;e] < = > I :; " Plural I" I:j <=> :V :V; N o t e t h a t s u r f a c e a n d l e x i c a l a l p h a b e t s a r e d e c l a r e d s e p a r a t e l y. T h i s g u a r a n t e e s t h a t the r o l e of e a c h s e g m e n t is u n i q u e l y determined. The sets are separate a l s o because it is no t a l w a y s e v i d e n t e.g. w h i c h s e g m e n t s a r e c o n s i d e r e d to be v o w e l s on the le xi ca l level. The first rule represents a set of s e v e r a l rules; W:a => :a ; W:e => :e ; W:6 => :6 ; i s i n t e r p r e t e d d u m m y v a r i a b l e X o c c u r s on b o t h s i d e s of the ru le. If it w o u l d -1A6-146
occur o n l y on the context side then the a b b r e v i a t i o n denotes o n e s i n g l e r u l e w i t h m a n y c o n t e x t p a r t s (one fo r e a c h possibility of X ). Another effect of the e x pansion of "where" c l a u s e s is the i n t r o d u c t i o n of s o m e n e w c h a r a c t e r pairs. P a i r s W:a, W:e, W:o do no t o c c u r a n y w h e r e in the d e s c r i p t i o n but t h e y are i m p l i c i t l y included. 3. STEPS OF THE COMPILATION The c o m p i l a t i o n of the t w o - l e v e l de s c r i p t i o n into finite s t a t e a u t o m a t a p r o c e e d s in s e v e r a l steps. T h e c o m p u t a t i o n relies e s s e n t i a l l y on Ron Kaplan's program pa ck ag es (FSM, FST) for m a n i p u l a t i n g f i n i t e s t a t e m a c h i n e s a n d t r a n s d u c e r s. T h e c o l l e c t i o n of r u l e s ha s to be t r e a t e d as a w h o l e b e c a u s e the set of c h a r a c t e r p a i r s (CPS) m i g h t be c h a n g e d if s o m e r u l e s ar e a l t e r e d t h us c h a n g i n g the i n t e r p r e t a t i o n of s o m e o t h e r rules. The steps of the c o m p i l a t i o n are: ( 1 ) ( 2 ) (3) (4) (5) T r a n s f o r m a t i o n of the t w o - l e v e l r u l e d e s c r i p t i o n i n t o a r e c u r s i v e l i s t e x p r e s s i o n w h e r e r u l e components and pairs are identified. Expa ns io n of "where" clauses in the rules. C o l l e c t i n g a l l pairs e x p l i c i t l y ment ioned in rules and d e f i n i t i o n s in a d d i t i o n to the d e f a u l t set of a l l x:x w h e r e x is b o t h a l e x i c a l and a s u r f a c e c h a r a c t e r. C o m p u t i n g the e x a c t i n t e r p r e t a t i o n of p a i r s w h i c h a r e not c o n c r e t e p a i r s (whe re b o t h the l e x i c a l an d the lexical comp onents are single characters). Some p a i r s l i k e :V l e a v e o n e l e v e l f u l l y o p e n a n d o t h e r s may use the defined subsets such as W:V (character W c o r r e s p o n d i n g to a n y of the v o w e l s b u t not to a z e r o 0). A l l s u c h ( p a r t i a l l y ) u n s p e c i f i e d p a i r s X:Y d e n o t e the s u b s e t of C P S c o n s i s t i n g of p a i r s x:y w h e r e x is X or is in X an d y is Y or is in Y. E x p a n d e a c h a b b r e v i a t i o n of the a b o v e t y p e in t o an alternation. Insert the defined e x pression in pl a c e of the name of the expression. -147-147
(6) C o m p i l e the c o mponents of the rules (correspondence parts, left cont exts and right contexts) into finite state machines. (7) Sp li t ru les with the operator <=> into one rule with operator => and another with <=. (8) E x p a n d r u l e s w i t h o p e r a t o r <= and w i t h m u l t i p l e cont ex ts into distinct rules with one context each. (9) Compile the individual comp on en t rules separately. (10) M e r g e the au to ma ta re su lt in g from a s i n g l e or ig in al rule into one rule au tomaton by in tersecting them. In the p r e s e n t v e r s i o n of the c o m p i l e r e a c h r u l e as d e f i n e d by the u s er is c o m p i l e d into a s i n g l e a u t o m a t o n. If the e x p a n s i o n or c o m p i l a t i o n s p l i t s the r u l e in t o s u b p a r t s t h e s e are f i n a l l y c o m b i n e d in t o a s i n g l e m a c h i n e by the c o m p i l e r. T h e c o m p i l a t i o n is d o n e on a X e r o x 1108 L i s p m a c h i n e w i t h p r o g r a m s w r i t t e n in I n t e r l i s p - D. T h e r e s u l t i n g a u t o m a t a c a n t h e n be us e d e i t h e r on the L i s p M a c h i n e or t r a n s p o r t e d to o t h e r sy st em s. In o r d e r to be u s e d by the p r e s e n t v e r s i o n of the Pascal t w o - l e v e l pr o g r a m the automata are c o n v e r t e d into a t a b u l a r f o r m a t w h i c h c a n be r e a d i l y used. T h e f o r m a t is s l i g h t l y di ff er en t from the o r iginal one g i v e n in Koskenniemi (1983) b u t it is s i g n i f i c a n t l y f a s t e r to r e a d in. S u c h automata ha ve been s u c c e s s f u l l y used on MS-DOS mi cr o comp uters s u c h as I B M PC a n d O l i v e t t i M 2 4. M a r t t i N y m a n a t t h e U n i v e r s i t y of H e l s i n k i working on one d e scription for M o d e r n Gr ee k and another for C l a s s i c a l Greek and Jo rm a L u utonen at the U n i v e r s i t y of T u r k u is w o r k i n g on o n e for C h e r e m i s. O l l i B l å b e r g has r e f o r m u l a t e d his Swedish d e s c ription in terms of the present compiler. T h e c o m p i l e r w a s w r i t t e n d u r i n g the s u m m e r 1 9 85 at the C e n t e r for S t u d i e s on L a n g u a g e a n d I n f o r m a t i o n at S t a n f o r d University. In ad di ti on to the finite state package written by Ron K a p l a n the c o m p i l e r ut il iz es Kaplan's concept of co mp il in g c o m p l e x rules with operator => and s e v e r a l context parts. The c o m p i l e r w a s p r e s e n t e d at a s y m p o s i u m o n f i n i t e s t a t e m o r p h o l o g y on July, 29-30 1985 at CSLI. The co mp il er has a l so s t i m u l a t e d some p a r a l l e l efforts (Bear, in press, Ritchie et. al. 1985, Kinnunen, in preparation). -148-148
REFERENCES Bear, John, (in press). A m o r p h o l o g i c a l r e c o g n i z e r w i t h syntactic and p h o n o l o g i c a l rules. P r oceedings of COLING- 86, Bonn. Kinnunen, Maarit, (in preparation). M o r f o l o g i s t e n saantojen kaan ta mi ne n a a r e l l i s i k s i automaateiksi. (The c o m p i l a t i o n of m o r p h o l o g i c a l r u l e s into f i n i t e s t a t e a u t o m a t a ) M a s t e r s t h e s i s. D e p a r t m e n t of C o m p u t e r S c i e n c e, U n i v e r s i t y of Helsinki. K o s k e n n i e m i, K. 1 9 8 3. T w o - l e v e l m o r p h o l o g y : A g e n e r a l c o m p u t a t i o n a l m o d e l f o r w o r d - f o r m r e c o g n i t i o n a n d production. Publications, No. 11, U n i v e r s i t y of Helsinki, De partment of Ge ne ra l Linguistics. Helsinki. ---, 1 9 8 4. A g e n e r a l c o m p u t a t i o n a l m o d e l for w o r d - f o r m recognition and production. P r o c eedings of COLING-84. pp. 178-181. R i t c h i e, G.D., S.G. P u l m a n, and G.J. R u s s e l, 1985. D i c t i o n a r y and m o r p h o l o g i c a l a n a l y z e r ( P r o t o t y p e ), U s e r guide: V e r s i o n 1.12. D e p a r t m e n t of A r t i f i c i a l I n t e l l i g e n c e, University of Edinburgh. - 1 ^ 9-149