Analogical Dissimilarity: definition, algorithms and first experiments in machine learning

Similar documents
Parse trees, ambiguity, and Chomsky normal form

Coalgebra, Lecture 15: Equations for Deterministic Automata

p-adic Egyptian Fractions

Minimal DFA. minimal DFA for L starting from any other

Convert the NFA into DFA

Designing finite automata II

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Closure Properties of Regular Languages

Vectors , (0,0). 5. A vector is commonly denoted by putting an arrow above its symbol, as in the picture above. Here are some 3-dimensional vectors:

1 Nondeterministic Finite Automata

Lecture 08: Feb. 08, 2019

CS 330 Formal Methods and Models

Chapter 2 Finite Automata

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

2.4 Linear Inequalities and Interval Notation

CM10196 Topic 4: Functions and Relations

Bases for Vector Spaces

Model Reduction of Finite State Machines by Contraction

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

Homework Solution - Set 5 Due: Friday 10/03/08

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Regular expressions, Finite Automata, transition graphs are all the same!!

CSE : Exam 3-ANSWERS, Spring 2011 Time: 50 minutes

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

set is not closed under matrix [ multiplication, ] and does not form a group.

Homework 3 Solutions

Lecture 3: Equivalence Relations

Formal languages, automata, and theory of computation

CS 275 Automata and Formal Language Theory

Harvard University Computer Science 121 Midterm October 23, 2012

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

3 Regular expressions

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

80 CHAPTER 2. DFA S, NFA S, REGULAR LANGUAGES. 2.6 Finite State Automata With Output: Transducers

Handout: Natural deduction for first order logic

Review of Gaussian Quadrature method

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Linear Inequalities. Work Sheet 1

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Formal Languages and Automata

CMSC 330: Organization of Programming Languages

The Regulated and Riemann Integrals

Chapter 5 Plan-Space Planning

1B40 Practical Skills

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Nondeterminism and Nodeterministic Automata

Torsion in Groups of Integral Triangles

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Quadratic Forms. Quadratic Forms

CS 311 Homework 3 due 16:30, Thursday, 14 th October 2010

First Midterm Examination

CS 330 Formal Methods and Models Dana Richards, George Mason University, Spring 2016 Quiz Solutions

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Linear Systems with Constant Coefficients

Tutorial Automata and formal Languages

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

Analytically, vectors will be represented by lowercase bold-face Latin letters, e.g. a, r, q.

First Midterm Examination

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

Name Ima Sample ASU ID

Formal Languages and Automata Theory. D. Goswami and K. V. Krishna

Section 4: Integration ECO4112F 2011

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

Finite Automata-cont d

Bridging the gap: GCSE AS Level

CHAPTER 1 Regular Languages. Contents

Lecture 09: Myhill-Nerode Theorem

Assignment 1 Automata, Languages, and Computability. 1 Finite State Automata and Regular Languages

Matrix Algebra. Matrix Addition, Scalar Multiplication and Transposition. Linear Algebra I 24

Things to Memorize: A Partial List. January 27, 2017

Surface maps into free groups

DFA minimisation using the Myhill-Nerode theorem

Section 6.1 INTRO to LAPLACE TRANSFORMS

CHAPTER 1 PROGRAM OF MATRICES

7.2 The Definite Integral

A negative answer to a question of Wilke on varieties of!-languages

Calculus Module C21. Areas by Integration. Copyright This publication The Northern Alberta Institute of Technology All Rights Reserved.

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O

CSE396 Prelim I Answer Key Spring 2017

CISC 4090 Theory of Computation

Boolean Algebra. Boolean Algebra

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

1 ELEMENTARY ALGEBRA and GEOMETRY READINESS DIAGNOSTIC TEST PRACTICE

NFAs continued, Closure Properties of Regular Languages

Automata Theory 101. Introduction. Outline. Introduction Finite Automata Regular Expressions ω-automata. Ralf Huuck.

Week 10: Line Integrals

The practical version

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

8. Complex Numbers. We can combine the real numbers with this new imaginary number to form the complex numbers.

Talen en Automaten Test 1, Mon 7 th Dec, h45 17h30

Designing Information Devices and Systems I Discussion 8B

Combinational Logic. Precedence. Quick Quiz 25/9/12. Schematics à Boolean Expression. 3 Representations of Logic Functions. Dr. Hayden So.

Transcription:

Anlogicl Dissimilrity: definition, lgorithms nd first experiments in mchine lerning Lurent Miclet, Arnud Delhy To cite this version: Lurent Miclet, Arnud Delhy. Anlogicl Dissimilrity: definition, lgorithms nd first experiments in mchine lerning. [Reserch Report] RR-5694, INRIA. 2005, pp.60. <inri- 00070321> HAL Id: inri-00070321 https://hl.inri.fr/inri-00070321 Sumitted on 19 My 2006 HAL is multi-disciplinry open ccess rchive for the deposit nd dissemintion of scientific reserch documents, whether they re pulished or not. The documents my come from teching nd reserch institutions in Frnce or rod, or from pulic or privte reserch centers. L rchive ouverte pluridisciplinire HAL, est destinée u dépôt et à l diffusion de documents scientifiques de niveu recherche, puliés ou non, émnnt des étlissements d enseignement et de recherche frnçis ou étrngers, des lortoires pulics ou privés.

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE Anlogicl Dissimilrity: definition, lgorithms nd first experiments in mchine lerning Lurent Miclet, Arnud Delhy N 5694 Septemre 2005 Thème COG ISSN 0249-6399 ISRN INRIA/RR--5694--FR+ENG pport de recherche

Anlogicl Dissimilrity: definition, lgorithms nd first experiments in mchine lerning Lurent Miclet, Arnud Delhy Thème COG Systèmes cognitifs Projet CORDIAL Rpport de recherche n 5694 Septemre 2005 60 pges Astrct: This pper defines the notion of nlogicl dissimilrity etween four ojects, with specil focus on dissimilrity etween ojects structured s sequences. Firstly, it studies the cse where the four ojects hve null nlogicl dissimilrity, i.e. re in n nlogicl reltion. Secondly, when one of these ojects is unknown, it gives lgorithms to compute it. In prticulr, it studies new formultion of solving nlogicl equtions on sequences, sed on the edit distnce etween strings. Thirdly, it tckles the prolem of defining nlogicl dissimilrity, which is mesure of how close four ojects re from eing in nlogicl reltion. To finish, it gives lerning lgorithms, i.e. methods to find the triple of ojects in lerning smple which hs the lest nlogicl dissimilrity with given oject. Key-words: y nlogy Anlogy, Sequences, Mchine Lerning, Anlogicl Dissimilrity, Lerning {Lurent.Miclet,Arnud.Delhy}@univ-rennes1.fr Unité de recherche INRIA Rennes IRISA, Cmpus universitire de Beulieu, 35042 Rennes Cedex (Frnce) Téléphone : +33 2 99 84 71 00 Télécopie : +33 2 99 84 71 71

Dissemlnce nlogique : définition, lgorithmes et premières expériences en pprentissge Résumé : Ce ppier définit l notion de dissemlnce nlogique entre qutre ojets, et se concentre plus prticulièrement sur l dissemlnce entre ojets structurés en séquences. Nous étudions tout d ord le cs où les qutre ojets ont une dissemlnce nlogique nulle, c est-à-dire où il sont en reltion d nlogie. Ensuite, qund un des qutre ojets est inconnu, nous donnons un lgorithme pour le clculer. En prticulier, nous étudions une nouvelle fçon de formuler l résolution des équtions d nlogie entre séquences, sée sur l distnce d édition entre chînes. Puis, nous ordons le prolème de l définition de l dissemlnce nlogique qui est une mesure de comien qutre ojets sont proches d être en reltion d nlogie. Pour terminer, nous donnons des lgorithmes d pprentissge, c est à dire des méthodes pour trouver le triplet d ojets dns un ensemle d pprentissge qui l plus file dissemlnce nlogique vec un ojet donné. Mots-clés : pr nlogie Anlogie, Séquences, Apprentissge, Dissemlnce nlogique, Apprentissge

Anlogicl Dissimilrity 3 Contents 1 Introduction. 5 1.1 Resoning nd lerning y nlogy........................ 5 1.1.1 Anlogicl reltion etween four ojects................. 5 1.1.2 Solving nlogicl equtions........................ 6 1.1.3 Lerning y nlogy............................. 6 1.2 Sitution of the pper nd iliogrphy...................... 7 1.3 Orgniztion of the pper............................. 8 2 Anlogy in finite sets. 8 2.1 The xioms of nlogy............................... 8 2.2 Definition of distnce coherent with nlogy.................. 9 2.3 Anlogy in sets.................................... 10 2.3.1 Defining finite sets y inry fetures................... 10 2.3.2 Defining finite lphets s cyclic groups................. 12 2.4 Anlogy in the vector spce R n........................... 14 2.4.1 An nlogicl reltion in R n........................ 15 2.4.2 The trnsitivity of nlogy in vector spces................ 15 2.4.3 A set of coherent distnces......................... 15 3 Anlogy etween sequences. 16 3.1 Nottions....................................... 16 3.2 A first definition................................... 16 3.3 A second definition using nlogy in lphets.................. 17 3.3.1 Motivtion.................................. 17 3.3.2 Anlogy etween sequences sed on lignments............. 18 3.3.3 Connection etween the two definitions.................. 19 4 Solving nlogicl equtions in sets. 20 4.1 Anlogicl equtions................................. 20 4.2 Solving nlogicl equtions in finite sets defined y inry fetures...... 20 4.3 Solving nlogicl equtions in finite groups................... 21 4.4 Solving nlogicl equtions in R n......................... 21 5 Solving nlogicl equtions in sequences. 22 5.1 Solving nlogicl equtions in sequences : n lgeric method........ 22 5.1.1 Shuffle.................................... 22 5.1.2 Complementry susequences nd complementry sets......... 23 5.2 Solving nlogicl equtions in sequences using the edit distnce........ 24 5.2.1 The edit distnce etween sequences.................... 24 5.2.2 Edit distnce nd nlogy......................... 26 5.2.3 Resolving nlogicl equtions using the edit distnce: first method. 27 RR n 5694

4 Lurent Miclet, Arnud Delhy 5.2.4 Resolving nlogicl equtions using the edit distnce: second method. 29 5.2.5 The two lgorithms re equivlent..................... 31 5.2.6 The cse of multiple optiml solutions nd the compred complexity of the two lgorithms............................ 32 5.3 The trnsitivity of nlogy in sequences...................... 33 5.4 Our lgorithms re more generl thn the lgeric method........... 34 6 Anlogicl dissimilrity etween sets. 34 6.1 Motivtion...................................... 34 6.2 A definition in finite sets defined y inry fetures............... 35 6.2.1 Definition................................... 35 6.2.2 Exmple................................... 36 6.2.3 Properties.................................. 36 6.3 A definition in R n.................................. 37 6.3.1 Definition................................... 37 6.3.2 Properties.................................. 37 6.4 A definition in cyclic group............................ 39 6.5 Comments on defining n nlogicl dissimilrity in metric spce....... 39 7 Anlogicl dissimilrity etween sequences. 40 7.1 A first definition................................... 40 7.1.1 Approximte nlogicl dissimilrity.................... 41 7.1.2 An exmple................................. 41 7.2 A etter definition.................................. 42 7.2.1 Anlogicl dissimilrity etween sequences................ 42 7.2.2 Properties.................................. 42 7.2.3 Algorithm.................................. 43 7.3 The two dissimilrities re not the sme...................... 44 7.4 Experiments on the nlogicl dissimilrity etween sequences......... 44 7.4.1 Constructing the set S........................... 45 7.4.2 Running the experiment.......................... 45 7.4.3 Conclusion.................................. 46 8 Anlogicl dissimilrity nd mchine lerning. 46 8.1 Motivtion...................................... 46 8.2 The rute force solution............................... 47 8.3 Fst nerest neighor serch: the AESA lgorithm................ 47 8.3.1 The principle................................. 47 8.3.2 Elimintion.................................. 48 8.3.3 Selection................................... 48 8.3.4 Reducing the precomputtion: LAESA.................. 49 8.4 "FADANA": FAst serch of the lest Dissimilr ANAlogy........... 49 8.4.1 Preliminry computtion.......................... 49 INRIA

Anlogicl Dissimilrity 5 8.4.2 Principle of the lgorithm.......................... 51 8.4.3 Nottions................................... 51 8.4.4 Initiliztion................................ 51 8.4.5 Selection................................... 51 8.4.6 Elimintion................................. 52 8.5 Selection of se prototypes in FADANA..................... 52 8.6 Experiments..................................... 55 9 Conclusion nd future work. 57 1 Introduction. 1.1 Resoning nd lerning y nlogy. Anlogy is wy of resoning which hs een studied throughout the history of philosophy nd hs een widely used in Artificil Intelligence ([28]) nd Linguistics. Lepge ([19]) hs given n extensive description of the history of this concept nd its pplictions in science nd linguistics. 1.1.1 Anlogicl reltion etween four ojects. An nlogy or nlogicl reltion etween four ojects A, B, C nd D in the sme universe is usully expressed s follows : A is to B s C is to D. Depending on wht re the ojects, nlogies cn hve very different menings. For exmple, nturl lnguge nlogies could e: crow is to rven s merlin is to peregrine or vinegr is to wine s sloe is to cherry. These nlogies re sed on the semntics of the words. By contrst, in the forml universe of sequences, nlogies such s cd is to c s d is to or g is to gt s gg is to ggt re morphologicl. Whether morphologicl or not, the exmples ove show the intrinsic miguity in defining n nlogy. We could s well ccept, for other good resons: g is to gt s gg is to ggtt or vinegr is to wine s cheese is to milk. Oviously, such miguities re inherent in semntic nlogies, since they re relted to the mening of words (the concepts re expressed through nturl lnguge). Hence, it seems importnt, s first step, to focus on forml morphologicl properties. Moreover, resolving syntctic nlogies in sequences is n opertionl prolem in severl fields of linguistics, such s morphology nd syntx, nd provides sis to lerning nd dt mining y nlogy in the universe of sequences. Severl forml definitions of the nlogicl reltions "is to" nd "s" will e defined in this rticle, ut the sic ide cn e grsped through exmples on sequences of letters: the nlogy egn is to egun s rn is to run holds true, from the linguistic point of view s well s from pure "sequence of letters" or "morphologicl" point of view. From this second point of view only, xn is to xun s yzn is to yzun is lso true, while egn is to egun s move is to moved is true only from the first point of view. RR n 5694

6 Lurent Miclet, Arnud Delhy In this pper, we will firstly consider nlogies in sets of letters nd secondly how they my e trnsferred to morphologicl nlogies etween sequences of letters. 1.1.2 Solving nlogicl equtions. When one of the four elements is unknown, n nlogicl reltion turns into n eqution. For instnce, on sequences of letters: wolf is to lef s wolves is to x. Resolving this eqution consists in computing the (possily empty) set of sequences x which stisfy the nlogy. The sequence leves is oth n ovious linguistic nd morphologicl solution. We shll see tht, however, it is not strightforwrd to design n lgorithm le to solve this kind of eqution. Solving nlogicl equtions on sequences is useful for linguistic nlysis tsks nd hs een pplied (with empiricl resolution techniques, or in simple cses) minly to lexicl nlysis tsks. For exmple, Yvon([34], [35]) presents n nlogicl pproch to the grphemeto-phoneme conversion, for text-to-speech synthesis purposes. More generlly, the resolution of nlogicl equtions cn lso e seen s sic component of lerning y nlogy systems, which re prt of the lzy lerning techniques [9]. 1.1.3 Lerning y nlogy. Let S e set of trining exmples S = {(x, u(x))}, where x is the description of n exmple (x my e sequence or vector in R d, for instnce) nd u(x) its lel in finite set. Given the description y of new pttern, we would like to ssign to y lel u(y), sed only from the knowledge of S. This is the prolem of lerning clssifiction rule from exmples ([25]), which consists in finding the vlue of u t point y. The nerest neighor method, which is the most populr lzy lerning technique, simply finds in S the description x which minimizes some distnce to y nd hypothesizes u(x ), the lel of x, for the lel of y. Moving one step further, nlogicl lerning serches in S for triple (x, z, t ) such tht x is to z s t is to y nd predicts for y the lel û(y) which is solution of the eqution u(x ) is to u(z ) s u(t ) is to û(y). If more thn one triple is found, voting procedure cn e used. Such lerning technique is sed on the resolution of nlogicl equtions. [26] discusses t length the relevnce of such lerning procedure for vrious linguistic nlysis tsks. It is importnt to notice tht y nd u(y) re in different domins: for exmple, in the simple cse of lerning clssifiction rule, y my e sequence nd u is merely clss lel. A further step in lerning y nlogy is to find in S for triple (x, z, t ) such tht x is to z s t is to y holds lmost true, or, when closeness mesure is defined, the triple which is the closest to y in term of nlogicl reltion. We study in this rticle how to quntify this mesure, in order to provide more flexile method of lerning y nlogy. INRIA

Anlogicl Dissimilrity 7 1.2 Sitution of the pper nd iliogrphy. This pper is relted with severl domins of rtificil intelligence. Oviously, the first one is tht of resoning y nlogy. Much work hs een done on this suject from cognitive science point of view, which hd led to computtionl models of resoning y nlogy (see the ook [13] nd, for exmple, the clssicl pper [12]). Usully, these works use the notion of trnsfer, which is not within the scope of this rticle. It mens tht some knowledge on solving prolem in domin is trnsported to nother domin. Since we work on four ojects tht re in the sme spce X, we implicitly ignore the notion of trnsfer etween different domins. Techniclly speking, this restriction llows us to use n xiom clled "exchnge of the mens" to define wht is n nlogy (see Definition 2.1). For exmple, if we work on strings, the four strings hve to e written on the sme lphet. However, we shre with these works is the following ide: there my e similr reltion etween two couples of structured ojects even if the ojects re pprently quite different. We re interested in giving forml nd lgorithmic definition of wht such reltion cn e. Our work ims lso t defining some supervised mchine lerning process ([25], [7]), in the spirit of lzy lerning ([2]). It mens tht we do not seek to extrct model from the lerning dt, ut merely conclude wht is the clss, or more generlly the supervision, of new oject y inspecting ( prt of) the lerning dt. Usully, lzy lerning, like the k-nerest neighors technique, mkes use of unstructured ojects, such s vectors. Since distnce mesures cn e lso defined on strings, trees nd even grphs, this technique hs lso een used on structured ojects, in the frmework of structurl pttern recognition (see for exmple [5, 4, 3]). We extend here the serch of the nerest neighor in the lerning set to tht of the est triple (when comined with the new oject, it is the closest to mke n nlogy). This requires to define wht is n nlogy on structured ojects, like sequences, ut lso to give definition of how fr four-uple of ojects is from eing in nlogy (tht we cll nlogicl dissimilrity). Lerning y nlogy on sequences hs lredy eing studied, in more restricted mnner, on linguistic dt ([34, 36, 16, 15], etc.). Resoning nd lerning y nlogy hs proven useful in tsks like grpheme to phoneme conversion, morphology nd even trnsltion. Sequences of letters nd/or of phonemes re nturl ppliction to our work, ut we re lso interested in the future on other type of dt, structured s sequences or trees, like prosodic representtions for speech synthesis, iochemicl sequences, etc. Anlogicl reltions etween four structured ojects of the sme universe, minly strings, hve een studied with mthemticl nd lgorithmic pproch, like ours, y Mitchell nd Hofstdter ([24, 14]), Dstni et l. ([10]), Schmid et l. ([31]), Yvon et l. ([38]). To the est of our knowledge, the use of the edit distnce s method of comprison etween sequences is originl in the frmework of nlogy, s is our forml definition of wht cn e nlogicl dissimilrity etween four ojects, fortiori etween sequences. To connect with nother field of A.I., let us quote A. Amodt nd E. Plz ([1]) out the use of the term "nlogy" in Cse-Bsed Resoning (CBR): "Anlogy-sed resoning: This term is sometimes used, s synonym to cse-sed resoning, to descrie the typicl cse-sed pproch. However, it is lso often used to chrcterize methods tht solve new RR n 5694

8 Lurent Miclet, Arnud Delhy prolems sed on pst cses from different domin, while typicl cse-sed methods focus on indexing nd mtching strtegies for single-domin cses." CBR is close to resoning y nlogy, since it focusses on the trnsfer etween different domins. Agin, we re interested here in forml nlogies etween four ojects of the sme domin. 1.3 Orgniztion of the pper This pper is orgnized in eight sections. After this introduction, we present in section 2 the generl principles which govern the definition of n nlogy etween four ojects in the sme universe. This is pplied firstly to three sets of ojects: ojects defined y inry fetures, finite cyclic groups nd vector spce R n. Section 3 descries wht is n nlogy etween four sequences, firstly ccording to definition y Lepge, nd Yvon nd Stropp, secondly giving n originl extension to sequences of letters on which nlogies re defined, nd using the edit distnce for the "is to" reltion. Sections 4 nd 5 study the resolution of nlogicl equtions, in sets nd in sequences. Concerning sequences, we recll the lgeric method y Yvon nd we present definition nd two different lgorithms to compute the set of solutions with more generl concept of nlogy. We show tht oth lgorithms (clled SEQUANA1 nd SEQUANA2) produce the set of ll the solutions ccording to our definition. Sections 6 nd 7 introduce the new concept of nlogicl dissimilrity (AD) etween four ojects, y mesuring in some wy how much these ojects re in nlogy. In prticulr, it must e equivlent to sy tht four ojects re in nlogy nd tht their nlogicl dissimilrity is null. We define this concept in the three sorts of sets tht we use s lphets nd we extend it to sequences. We give n lgorithm to compute the vlue of AD etween four sequences, which is sed on the sme ide thn SEQUANA2. We lso show tht n lgorithm sed on SEQUANA1 cnnot produce the sme result. To finish, we give few experiments to mesure how AD cn cope with noisy dt. Section 8 egins to explore the use of the concept of nlogicl dissimilrity in supervised mchine lerning. We extend the AESA lgorithm of fst serch of the nerest neighor to tht of the fst serch of the est nlogicl triplet in the lerning set. This lgorithm, clled FADANA, is tested on rtificil dt to mesure its efficiency. The lst section is conclusion nd presents work to e done, prticulrly in presenting some possile rel world ppliction of lerning y nlogy in the universe of sequences. 2 Anlogy in finite sets. 2.1 The xioms of nlogy. There is no generl definition of n nlogicl reltion A is to B s C is to D etween four ojects in set X, the is to nd the s reltions depending on the nture of X. However, ccording to the usul mening of the word nlogy in philosophy nd linguistics, three sic xioms re generlly required ([19]): INRIA

Anlogicl Dissimilrity 9 Definition 2.1 (Anlogy.) An nlogy on set X is reltion on X 4, i.e. suset A X 4. When (A, B, C, D) A, the four elements A, B, C nd D re sid to e in nlogy, nd we write: "the nlogicl reltion A : B :: C : D holds true", or simply A : B :: C : D, which reds "A is to B s C is to D". For every four-uple in nlogy, we hve the following properties: Symmetry of the s reltion: C : D :: A : B Exchnge of the mens: A : C :: B : D A third xiom ( determinism) requires tht one of the two following impliction holds true (the other eing consequence): A : A :: B : X X = B A : B :: A : X X = B According these xioms, five other formultions re proven to e equivlent to A : B :: C : D : B : A :: D : C D : B :: C : A C : A :: D : B D : C :: B : A nd B : D :: A : C A consequence of the first two xioms is tht there re only three different possile nlogies etween four ojects, with the cnonicl forms: A : B :: C : D A : C :: D : B A : D :: B : C 2.2 Definition of distnce coherent with nlogy. X is metric set if there exists distnce δ on X, ccording to the clssicl following definition. Definition 2.2 (Distnce on set X.) A distnce δ on set X is mpping of X X on R, with the following properties: Reflexivity. x X, δ(x, x) = 0 Strict positiveness. x, y X, x y δ(x, y) > 0 Symmetry. x, y X, δ(x, y) = δ(y, x) Tringle inequlity. x, y, z X, δ(x, y) δ(x, z) + δ(z, y) If there exists n nlogy on metric spce X, reltion etween the distnce nd the nlogy on X cn e defined. Its usefulness will pper lter in this pper, especilly when defining nlogies on sequences (section 3 nd 5) nd nlogicl dissimilrities (sections 6 nd 7). RR n 5694

10 Lurent Miclet, Arnud Delhy B D A C Figure 1: Anlogy nd coherent distnce in R n. When the nlogy A : B :: C : D stnds true, the four elements form prllelogrm nd the euclidin distnce δ 2 hs the properties : δ 2 (u, v) = δ 2 (w, x) nd δ 2 (u, w) = δ 2 (v, x) Definition 2.3 (Distnce coherent with nlogy.) A distnce δ is sid coherent with nlogy if for every four-uple A, B, C nd D in the nlogicl reltion: δ hs the properties : A : B :: C : D δ(a, B) = δ(c, D) nd δ(a, C) = δ(b, D) It is clerly the cse, for exmple, if X = R n, four elements eing in nlogy if they form prllelogrm (see Figure 1), nd if δ is the euclidin distnce. This exmple will e developed t section 2.4. 2.3 Anlogy in sets. 2.3.1 Defining finite sets y inry fetures. Let X e finite set or lphet, composed of elements tht we will cll ojects. We ssume tht there exists set F, with crdinl n, of inry fetures such tht every oject x X cn e defined y inry vector (f 1 (x),..., f n (x)). For every x nd every i [1, n], f i (x) = 1 (resp. f i (x) = 0) mens tht the inry feture tkes the vlue T RUE or 1 (resp. F ALSE or 0) on the oject x. We cll such set X finite set defined y inry fetures. Equivlently, n oject x X cn e seen s suset of F, composed of the elements f i such tht f i (x) = 1. Therefore, studying wht is nlogy etween four ojects in n lphet defined y inry fetures is equivlent to studying wht is nlogy etween four sets. INRIA

Anlogicl Dissimilrity 11 A first definition. When the "s" reltion is the equlity etween sets, Lepge hs given definition of n nlogicl reltion etween sets coherent with the xioms. Definition 2.4 (Anlogy etween sets.) Four sets A, B, C et D re in nlogy A : B :: C : D if nd only if A cn e trnsformed into B nd C into D y dding nd sutrcting the sme elements to A nd C. This is the cse, for exmple, of the four sets : A = {t 1, t 2, t 3, t 4, }, B = {t 1, t 2, t 3, t 5 } nd C = {t 1, t 4, t 6, t 7 }, D = {t 1, t 5, t 6, t 7, }, where t 4 hs een tken off from, nd t 5 hs een dded to A et C, giving B nd D. With this definition, Lepge ([18]) hs shown doule necessry condition of inclusion etween four sets to e in nlogicl reltion: A B C nd A B C (2.1) In section 4.2 we will see how, under this condition, unique solution D cn e given to the eqution A : B :: C : x, with respect to the xioms of nlogy: x = ((B C)\A) (B C) A second equivlent definition. Stropp nd Yvon hve given nother definition of the nlogy etween four sets, which proves to e equivlent to tht of Lepge ([32]). Definition 2.5 (Anlogy etween sets.) Four sets A, B, C et D re in nlogy A : B :: C : D if nd only if there exists four sets X, Y, Z et T such tht : A = X Y B = X Z C = T Y D = T Z We hve given in [23] the complete proof tht the two definitions re equivlent. The sketch is strightforwrd: the inclusion conditions of eqution 2.1 imply tht, mong the 16 disjoint sets creted y the intersection of A, B, C nd D, only 5 re non empty. They cn e comined y union either ccording to the first definition or to the second. The trnsitivity of nlogy in sets. In sets, the nlogy hs the property of trnsitivity: Property 2.1 (Trnsitivity of nlogy in sets.) Let A, B, C, D, E nd F e six sets. Then the following impliction holds: ( A : B :: C : D nd C : D :: E : F ) A : B :: E : F RR n 5694

12 Lurent Miclet, Arnud Delhy A distnce in sets defined y inry fetures coherent with nlogy. Let X e set defined y inry fetures. We cn see n element A of X either s the set of fetures tht re T RUE on A or s inry vector of size n, where n is the totl numer of fetures (the crdinl of the set F). In the first cse, we define the distnce δ(a, B) etween two elements A nd B of X s the crdinl of symmetricl difference etween the two sets. In the second cse, δ(a, B) is the Hmming distnce etween the two inry vectors. Oviously, the two definitions re equivlent. We hve the following property : Property 2.2 (Coherence of the symmetricl difference.) Let δ(a, B) e the crdinl of the symmetricl difference etween the two sets A nd B. Then δ is coherent with the nlogy on sets. The proof is strightforwrd, from the definition of nlogy on X. 2.3.2 Defining finite lphets s cyclic groups. In this section, we define n nlogy nd coherent distnce on finite cyclic groups. We strt from finite set with n inner opertor nd we exmine wht properties re requested in connexion with the nlogy. This construction is sufficient to eventully insure tht every nlogicl eqution hs unique solution (this point is developed in section 4.3). Let (X, ) e set with n inner opertor nd n nlogy. Let,, c nd d e four elements of X. We connect the opertor to the nlogy on X y requiring the following property : ( d = c ) ( : :: c : d ) Properties of the opertor ccording to the nlogy. We hve given in definition 2.1 the xioms of nlogy s descried y Lepge. From ech xiom, we deduce n lgeric property for the opertor. Symmetry. : :: c : d c : d :: :, tht is: d = c c = d Exchnge of the mens. : :: c : d : c :: : d tht is d = c d = c From this, we conclude tht the opertor must e commuttive, since c = c if : :: c : d. Determinism. : :: c : x x = c nd : :: : x x =. It cn e expressed with y : x = c x = c nd x = x = The first eqution expresses the property of left regulrity. Becuse of the commuttivity, we cn stte tht must e regulr. Uniqueness of the solution. We nticipte here on section 4.3 to go long with our construction. To solve n nlogicl eqution : :: c : x is to find every element x which verifies this reltion. In the cse of finite lphets with n opertor defined s ove, we cn mnge so tht every nlogicl eqution hs unique solution. For this purpose, INRIA

Anlogicl Dissimilrity 13 we need to consider specil element of X, nmely, to e the neutrl element of (X, ), i.e. =. In this cse, the nlogicl eqution : :: : x, which cn lso e expressed s : x = = would hve unique solution if every element in X hs unique symmetric. Assuming tht X is group [27] is sufficient to get this properties. Moreover, this group is elin since is commuttive. A sufficient mnner to give the finl construction of this group is to tke it s n dditive cyclic group. The tle of the opertor is given on group of size 7 in Figure 2. c d e f c d e f c d e f c d e f c c d e f d d e f c e e f c d f f c d e Figure 2: A tle for n nlogicl opertor on n lphet of 6 elements plus, seen s the dditive cyclic group G 7. A distnce in lphets defined s finite groups coherent with nlogy. We wnt here to uild distnce δ on the lphet which is coherent with the nlogy. Tht is, for qudruple (x, y, z, t) in nlogy, we wnt to hve the equlity : ( (x : y :: z : t) (x t = y z) ) ( δ(x, y) = δ(z, t) ). For exmple, we know from Figure 2, considering the element c, tht δ is such tht: c = c = δ(c, ) = δ(, ) c = c = f d δ(c, f) = δ(d, ) c = c = e e δ(c, e) = δ(e, ) c = = f d δ(, f) = δ(d, ) c = = e e δ(, e) = δ(e, ) c = f d = e e δ(f, e) = δ(e, d) Considering ll such equtions deduced from nlogicl equtions given y the nlogy tle, we cn deduce constrints on the distnce δ. In tht wy we cn show (see [22]) tht the distnce tle hs only n 2 different vlues nd hs circulnt structure. The tle in Figure 3 represents distnce if the vlues of the vriles (α, β, γ) verify the tringle inequlity (positivity, symmetry nd identity re verified if the vlues re positive). We hve wy to construct such distnce tle y using the geometricl representtion of finite cyclic group in R 2 : we plce the letters regulrly on circle (see Figure 4) nd RR n 5694

14 Lurent Miclet, Arnud Delhy δ c d e f 0 α β γ γ β α α 0 α β γ γ β β α 0 α β γ γ c γ β α 0 α β γ d γ γ β α 0 α β e β γ γ β α 0 α f α β γ γ β α 0 Figure 3: The tle of the distnce δ on the finite group G 7. we define the distnce etween letters s the euclidin distnce in R 2. This is sufficient to construct distnce, since every triple of elements is tringle in R 2. c α d γ β f e Figure 4: Representing G 7, the dditive cyclic finite group with 7 elements, nd defining distnce on G 7. 2.4 Anlogy in the vector spce R n. We hve shown in the previous sections how to uild nlogies nd coherent distnces in two different finite lphets, the first one eing defined y inry fetures, the second eing defined s finite group. We re interested now in the cse where X is the vector spce R n. An nlogy etween four ojects, or vectors, in R n is usully (see [32]) informlly defined s follows:,, c, nd d re four vectors in nlogicl reltion if nd only if they construct prllelogrm in R n. INRIA

Anlogicl Dissimilrity 15 2.4.1 An nlogicl reltion in R n. Let O e the origin of the vector spce. Let = ( 1, 2,..., n ) e vector of R n, s defined y its n coordintes. Let,, c nd d e four vectors of R n. The interprettion of n nlogy : :: c : d is usully tht,, c, d re the summits of prllelogrm, nd d eing opposite summits. In the form of n nlogicl eqution, written in the vectoril mnner: Definition 2.6 (Anlogy in R n.) Four vectors of R n re in the nlogy : :: c : d if nd only if they form prllelogrm: O O = Oc Od which cn e lso written: = cd or equivlently c = d It is strightforwrd tht the xioms of nlogy, given in section 2.1 derive from this definition. 2.4.2 The trnsitivity of nlogy in vector spces. In vector spces, the nlogy hs lso the property of trnsitivity: Property 2.3 (Trnsitivity of nlogy in R n.) Let,, c, d, e nd f e six vectors of R n. Then the following impliction holds: ( : :: c : d nd c : d :: e : f ) : :: e : f The proof comes from definition 2.6. 2.4.3 A set of coherent distnces. We recll tht distnce δ is sid coherent with nlogy if for every four-uple,, c nd d which is in the nlogicl reltion : : :: c : d the distnce δ hs the properties : δ(, ) = δ(c, d) nd δ(, c) = δ(, d) In R n, ny distnce δ p defined from the norm p δ p (, ) = p = ( n i i p) 1/p is coherent with the nlogy defined y = cd. This is directly proven from clssicl property of euclidin spces: for every distnce δ p in R n, = cd implies δ p (, ) = δ p (c, d). i=1 RR n 5694

16 Lurent Miclet, Arnud Delhy 3 Anlogy etween sequences. In this section, we present two different mnners to define n nlogicl reltion etween four sequences of ojects. After hving given some clssicl nottions out sequences, we will firstly give definition y Yvon, which refines nd consolidtes tht y Lepge. Then we present definition of ours, tht we show to e more generl. The ojects which uilt the sequences cn e elements of finite lphet (for our purpose, either defined y inry fetures or eing finite group) or vectors of n euclidin spce. 3.1 Nottions. A sequence 1 is finite series of symols from n lphet Σ. Σ is the set of ll words. For x, y in Σ, xy is the conctention of x nd y. We lso denote x the length of x, nd x = x 1... x x or x = x[1]... x[n], with x i or x[i] Σ nd n = x. We denote ɛ the empty word, of null length, nd Σ + = Σ \{ɛ}. Finlly, we denote L(x) the suset of Σ in which re tken the letters of the word x nd L() the suset of Σ composed of the letters tht do not pper in x. A fctor (or suword) f of sequence x is sequence in Σ such tht there exists two sequences u nd v in Σ with: x = ufv. For exmple,, c nd c re fctors of c. A susequence of sequence x = x 1... x x is composed of the letters of x with the indices i 1... i k, such tht i 1 < i 2... < i k. For exmple, c nd re two susequences of c. 3.2 A first definition. Yvon ([37]) gives the following definition of nlogy etween sequences: Definition 3.1 (Anlogy etween sequences, first definition.) (x, y, z, t) Σ + re in nlogicl reltion, noted x : y :: z : t if nd only if n > 0, α i, i [1, n], β i, i [1, n] Σ such tht, either: x = α 1...α n,t = β 1...β n,y = α 1 β 2 α 3...α n,z = β 1 α 2 β 3...β n or x = α 1...α n,t = β 1...β n,y = β 1 α 2 β 3...α n,z = α 1 β 2 α 3...β n nd i, α i β i ɛ. The smllest integer n for which this property holds is clled the degree of the reltion. For instnce, reception : refection :: deceptive : defective, is n nlogy etween sequences, with n = 3 nd the fctors: α 1 = re, α 2 = cept, α 3 = ion, β 1 = de, β 2 = fect, β 3 = ive. We could lso hve chosen the following fctors: 1 More clssiclly in lnguge theory, word. INRIA

Anlogicl Dissimilrity 17 α 1 = r, α 2 = ecept, α 3 = ion nd ccordingly β 1 = d, β 2 = efect nd β 3 = ive. The degree of n nlogicl reltion cn e seen s mesure of its complexity: the smller the degree, the etter the nlogy. This mtches the intuition tht good nlogies should preserve lrge portions of the originl words; the trivil nlogy involving identicl words. Given this definition, the following properties hold (see lso [17]): x Σ +, x : x :: x : x (3.1) x, y Σ + : x : x :: y : y (3.2) x, y, z, t Σ + : x : y :: z : t z : t :: x : y (3.3) x, y, z, t Σ + : x : y :: z : t x : z :: y : t (3.4) which proves tht this definition of nlogy is consistent with the xioms given in section 2.1. Lepge nd Yvon hve proven tht the two following conditions re necessry for the sequences x, y, z nd t to e in nlogy : Symol inclusion: L(x) L(t) = L(y) L(z) (3.5) Similrity: x + t =y + z. 3.3 A second definition using nlogy in lphets. 3.3.1 Motivtion. The definition of nlogy etween strings given in the previous section is quite strict in the sense tht the fourth term is constructed with letters tht hve ppered in the three others terms. Moreover, letters re considered s independent ojects. In prticulr, if there is some nlogicl reltion on the lphet, it cnnot e trnsmitted to sequences. We study now how to consider this possiility. For exmple, ssume tht we hve the lphet Σ = {,, α, β, B, C} in which there exists the nlogicl reltions: : :: A : B, : α :: : β, A : α :: B : β, nd tht the following nlogicl eqution is proposed on Σ : BAB : αα :: BAB : x We certinly would conclude, y exmining letter y letter, tht x = ββ. Such solution cnnot e otined in the frmework given in the lst section, since the letter β ppers nowhere in the first three terms of the eqution. Therefore, we would like to extend the definition of nlogy etween sequences to such cses. We lso wnt lso to ccept nlogies on sequences with no constrints on their lengths. This is why we hve previously studied nlogies on sets, which will e used s the lphets of the sequences. RR n 5694

18 Lurent Miclet, Arnud Delhy 3.3.2 Anlogy etween sequences sed on lignments. We give here more generl definition of nlogy etween four sequences nd show tht it stisfies the xioms given in section 2.1. Let Σ e n lphet. We dd new letter to Σ, tht we denote, giving the ugmented lphet Σ. The interprettion of this new letter is simply tht of n "empty" symol, tht we will need when computing edit distnces etween sequences in susequent sections. Definition 3.2 (Semntic equivlence.) Let x e sequence of Σ nd y sequence of Σ. x nd y re semnticlly equivlent if the susequence of y composed of letters of Σ is x. We denote this reltion y. For exmple,. Let us ssume tht there is n nlogy in Σ, i.e. tht for every 4-uple,, c, d of letters of Σ, the reltion : :: c : d is defined s eing either T RUE or F ALSE. Let δ e distnce on Σ, coherent with the nlogy. Definition 3.3 (Alignment etween sequences.) An lignment etween two sequences x, y Σ, of lengths m nd n, is word z on the lphet (Σ ) (Σ ) {(, )} which first projection is semnticlly equivlent to x nd which second projection is semnticlly equivlent to y. Informlly, n lignment represents one-to-one letter mtching etween the two sequences, in which some letters my e inserted. The mtching (, ) is not permitted. An lignment cn e presented s n rry of two rows, one for x nd one for y, ech word completed with some, resulting in two words of Σ hving the sme length. For instnce, here is n lignment etween x = gef nd y = cde : x y = = c d g e e f We cn define in the sme wy n lignment etween more sequences. The following definition uses lignments etween four sequences. Definition 3.4 (Anlogy etween sequences, second definition.) Let u, v, w nd x e four sequences on Σ, on which n nlogy is defined. We sy tht u, v, w nd x re in nlogy in Σ if there exists four sequences u, v, w nd x of sme length n in Σ, with the following properties: 1. u u, v v, w w nd x x. 2. i [1, n] the nlogies u i : v i :: w i : x i hold true in Σ. We prove here tht this definition verifies the xioms of nlogy. These xioms hold true for ech 4-uple u i, v i, w i nd x i, i.e. : INRIA

Anlogicl Dissimilrity 19 1. w i : x i :: u i : v i 2. u i : w i :: v i : x i 3. u i = v i w i = x i Therefore, y conctention of the n terms, we hve in Σ : 1. w : x :: u : v 2. u : w :: v : x 3. u = v w = x And, y semntic equivlence, we hve in Σ : 1. w : x :: u : v 2. u : w :: v : x 3. u = v w = x which ensures tht the xioms of nlogy re verified for definition 3.4. For exmple, let Σ = {,, α, β, B, C, } with the nlogies : :: A : B, : α :: : β nd A : α :: B : β. The following lignment etween the four sequences BA, αba, nd β is n nlogy on Σ : α β B B A A 3.3.3 Connection etween the two definitions. We estlish here the following property: Property 3.1 The nlogy etween sequences sed on lignments tht we hve given t Definition 3.4 is strictly more generl thn tht defined y Yvon nd Stropp (Definition 3.1). The demonstrtion consists in redefining n nlogy y Yvon nd Stropp in terms of lignments. It is esy to see tht the lignments corresponding to their definition re suset of RR n 5694

20 Lurent Miclet, Arnud Delhy ll the possile lignments, since every column would hve one of the two following prticulr forms, for Σ: or 4 Solving nlogicl equtions in sets. 4.1 Anlogicl equtions. To solve n nlogicl eqution consists in finding the fourth term of n nlogicl reltion, the first three eing known. Definition 4.1 (Anlogicl eqution.) t is solution of the nlogicl eqution: x : y :: z :? if nd only if x : y :: z : t. We lredy know from previous sections tht, depending on the nture of the ojects nd the definition of nlogy, n nlogicl eqution my hve no solution, unique solution 2 or severl solutions. We study in the sequel how to solve nlogicl equtions in the different sets tht we hve introduced. Then we give two definitions of the solving nlogicl equtions in sequences, the second eing originl, nd we show tht it is more generl thn the first one. 4.2 Solving nlogicl equtions in finite sets defined y inry fetures. Considering nlogy in sets, Lepge ([18]) hs shown the following theorem, with respect to the xioms of nlogy (section 2.1) : Theorem 4.2 (Solution of n nlogicl eqution in sets.) Let A, B nd C e three sets. The nlogicl eqution A : B :: C : D where D is the unknown hs solution if nd only if the following conditions hold true : A B C nd A B C The solution is then unique, given y : D = ((B C)\A) (B C) 2 This is the only cse where the nlogy is trnsitive, see section 5.3. INRIA

Anlogicl Dissimilrity 21 If we ssume tht A, B nd C re susets of set P, the solution cn e lso written, denoting A the complement of A in P, s the union of three disjoint sets D = (B A) (C A) (A B C) This theorem pplies lso to the resolution of nlogicl equtions in set X defined y inry fetures. Recll tht for ech x X nd ech i [1, n], f i (x) = 1 (resp. f i (x) = 0) mens tht the inry feture f i tkes the vlue T RUE (resp. F ALSE) on the oject x. Let A : B :: C : D e n nlogicl eqution where D is the unknown. For ech feture f i, there re only eight different possiilities of vlues on A, B nd C. From the theorem ove we cn derive the mnner of computing D, with the two following principles: Ech feture f i (D) cn e computed independently. The following tle gives the solution f i (D) : f i (A) 0 0 0 0 1 1 1 1 f i (B) 0 0 1 1 0 0 1 1 f i (C) 0 1 0 1 0 1 0 1 f i (D) 0 1 1?? 0 0 1 In two cses mong the eight, f i (D) does not exists. This derives from the defining of X y inry fetures, which is equivlent to defining X s set of fetures. Theorem 4.2 imposes conditions on the resolution of nlogicl equtions on finite sets, which results in the fct tht two inry nlogicl equtions hve no solution. 4.3 Solving nlogicl equtions in finite groups. We hve constructed lphets s finite groups in section 2.3.1 with the explicit purpose tht every nlogicl eqution hs one nd only one solution. We show on n exmple how this solution is computed. Let G 7 e the group defined y the opertor given in Tle 2 nd the corresponding distnce (Tle 3). Recll tht tht four elements in finite group re in nlogy when x = c : :: c : x Let s tke s n exmple the resolution of the nlogicl eqution e : :: c : x. It consists in looking in the tle of Figure 2 the vlue of c, which gives d, nd in serching in the sme tle wht is the unique element x such tht c = e x = d, which gives x = f. 4.4 Solving nlogicl equtions in R n. Solving the nlogicl eqution u : v :: w : x, where u, v nd w re vectors of R n nd x is the unknown derives directly from the definition of nlogy in vector spces: the four vectors must form prllelogrm. There is lwys one nd only one solution given y the eqution: Ox = Ov + Ow Ou RR n 5694

22 Lurent Miclet, Arnud Delhy 5 Solving nlogicl equtions in sequences. 5.1 Solving nlogicl equtions in sequences : n lgeric method. Solving n nlogicl eqution consists in computing the fourth term of n nlogicl reltion, given the three others. We consider in this section the first definition of nlogy in sequences s defined y Yvon nd Lepge (see section 3.2). In this frmework, not ll nlogicl equtions hve solution: for instnce, the eqution c : def :: ijk :? does not hve ny solution. We hve given in section 3.2 couple of necessry conditions for n nlogicl eqution to hold true, nmely symol inclusion nd similrity. Symol inclusion requires tht x, y, z nd t cn only e in nlogy when ll symols in x occur either in y or in z. Then t contins precisely those symols in y nd z tht re not found in x. Similrity requires tht ll the solutions of n nlogicl eqution hve the sme length. Conversely, some nlogicl equtions my hve more thn one solution. For instnce, c : c :: c :?, hs two eqully cceptle solutions: c nd c. Yvon nd Stropp hve shown tht the set of solutions of n nlogicl reltion on words, ccording to definition 3.2 cn e expressed in the terms of two sic constructions on words nd lnguges, the shuffle nd the complementry set constructions. 5.1.1 Shuffle. The notion of shuffle is introduced (eg. in [29]) s follows. Definition 5.1 (Shuffle.) If u nd v re two words in Σ, their shuffle is the lnguge defined s: u v = {u 1 v 1 u 2 v 2... u n v n, with u i, v i Σ, u = u 1... u n, v = v 1... v n } Informlly, the shuffle of two words u nd v contins ll the words w which cn e composed using ll the symols in u nd v, with the constrint tht if symol precedes in u (or in v), then it must precede in w. For instnce, if u = c nd v = def, the words cdef, defc, decf nd mny others re in u v; this is not the cse with efcd, in which d occurs fter, rther thn efore, e. The shuffle opertion hs the following sic properties: u ɛ = {u} (ɛ is neutrl ) (5.1) u v = v u (commuttivity) (5.2) (u v) w = u (v w) (ssocitivity) (5.3) u(v w) (uv) w (5.4) The shuffle is generlized to lnguges ccording to the following definition: K L = u v u L,v L INRIA

Anlogicl Dissimilrity 23 As simplifiction, we will identify u v nd {u} {v}. 5.1.2 Complementry susequences nd complementry sets. The notion of the complementry susequence is, in some respect, the converse of the shuffle opertions, nd is defined s follows: Definition 5.2 (Complementry susequences nd set.) If x is susequence of w, the complementry set of x with respect to w is defined s: w\x = {y Σ, I = {i 1... i k }, i 1 <... i k, st. y = w i1... w ik nd x = w 1...... w i1 1w i1+1... w i2 1...}. If x is not susequence of w, w\x is empty. The complementry set of x with respect to w contins wht remins of w when the symols in x re removed. If y is in w\x, we will sy tht y is complementry susequence of x in w. For instnce, the complementry set of f lse wrt. f lsehood is the singleton {hood}; the complementry susequences of ive wrt. derivtive is the set: {dertiv, dervti, derivt}. This opertion cn e turned into symmetric inry reltionship s follows: Definition 5.3 (Complementry reltionship.) w Σ denoted \ w nd defined s: u\ w v if nd only if u w\v. define inry reltionship The complementry set of word x wrt. lnguge L generlizes this notion nd is defined s: L\x = w\x w L Similrly, one cn lso define the complementry set of lnguge K wrt. word w s: w\l = w\x x L This opertion is finlly extended to lnguges: if K nd L re two lnguges, the complementry L\K of K with respect to L is the union over words in L of their complementry set with respect to K: L\K = w\k = L\x w L,x K w\x = w L The notions of complementry set nd shuffle re relted through the following properties: x K Property 5.1 w u v u w\v We will lso need the following property: RR n 5694

24 Lurent Miclet, Arnud Delhy Property 5.2 u, v Σ, w susequence of u : (u\w) v (u v)\w The following theorem yields forml definition of the set of solutions of n nlogicl eqution ([37]): Theorem 5.4 t is solution of x : y :: z :? t y z\x Not ll nlogicl equtions hve solution in this frmework. The solution must fulfill the necessry conditions given in section 3.2. In prticulr, the symol inclusion condition implies tht every letter of the solution, if there is one, must pper in one of the three first terms of the eqution. 5.2 Solving nlogicl equtions in sequences using the edit distnce. We show here how to use the edit distnce to define wht sequence is to nother one, nd to define the "s" reltion of the nlogy from the wy the edit distnce is computed. Informlly, in the nlogy BAB : αα :: BAB : ββ, the first sequence is trnsformed into the second with the series of trnsformtions of letter x into nother letter y, denoted S x y : S α S α S B S A S B Similrly, the sequence BAB is trnsformed into the sequence ββ with the sequence: S β S β S B S A S B Expressing tht the four sequences re in nlogy is, in this simple cse, merely mtching the two sequences of trnsformtions one to one from left to right, nd noting tht the corresponding 4-uple of letters re in nlogy in the lphet: S α S α S B S A S B S β S β S B S A S B : α :: : β : α :: : β B : :: B : A : :: A : B : :: B : Actully, the edit distnce cn not only del with sustitutions etween letters, ut lso with insertions nd deletions of letters, s explined in the following section. This is why we hve introduced specil symol when defining the nlogy etween sequences t section 3.3.2. 5.2.1 The edit distnce etween sequences. To introduce the edit distnce, we hve to give more definitions nd quote theorem, demonstrted y Wgner nd Fischer in [33]. Firstly, we present the notion of edition etween sequences, sed on three edit opertions etween letters: the insertion of letter in INRIA

Anlogicl Dissimilrity 25 the trget sequence, the deletion of letter in the source sequence nd the sustitution, which mens replcing letter in the source sequence y nother letter in the trget sequence. Ech of these opertions cn e ssocited to some positive cost. We denote S the sustitution from into, with positive cost δ(, ), S the deletion of with positive cost δ(, ), nd S the insertion of with positive cost δ(, ). The cost of the edition etween sequences is the sum of the costs of the opertions etween letters required to trnsform the source sequence into the trget one. We recll wht is n lignment (Definition 3.3). An lignment etween two words x, y Σ, is word z on the lphet (Σ { }) (Σ { }) {(, )} which projection on the first component is semnticlly equivlent to x nd which projection on the second component is is semnticlly equivlent to y. We introduce now the edit distnce etween sequences, nd show sufficient condition for four sequences to e in nlogy y using this distnce. Definition 5.5 (Distnce on ugmented lphets.) We denote now Σ nd we sy tht δ is distnce 3 on the ugmented lphet Σ iff : = Σ { } 1. δ is n ppliction of Σ Σ on R +, defined for every couple of elements, except for δ(, ). 2., Σ for which δ is defined : δ(, ) = 0 = 3., Σ for which δ is defined : δ(, ) = δ(, ) 4.,, c Σ for which δ is defined : δ(, ) δ(, c) + δ(c, ) Theorem 5.6 (Edit distnce etween sequences([33]).) Let δ e distnce on Σ, nd x nd y e sequences of Σ. The edit distnce D is the cost of n lignment with the lowest cost tht trnsforms x into y. An lignment corresponding to the edit distnce, tht of lowest cost, is clled optiml. We denote S(x, y) the sequence of trnsformtions corresponding to the optiml lignment 4. It is now possile to use the clssicl dynmic progrmming Wgner nd Fisher lgorithm ([33], lgorithm 1) which computes the edit distnce nd the optiml lignment. A consequence of this lgorithm is the following remrkle result [8], which justifies the nme of edit distnce : Theorem 5.7 If δ is distnce on the ugmented lphet Σ then D is distnce 5 on Σ. This lgorithm cn e completed in constructing the optiml lignment etween x nd y, or ll the optiml lignments if there re more thn one. This is done y keeping more 3 We keep the word "distnce", since this definition is only slightly dpted from the clssicl one. 4 There my e severl optiml lignments. We will exmine this cse in section 5.2.6. For the ske of simplicity, we will presently ssume tht there is n unique optiml lignment. 5 In the usul sense, given t definition 2.2. RR n 5694

26 Lurent Miclet, Arnud Delhy Algorithm 1 Computing the edit distnce D(x, y) etween two sequences x nd y in Σ with the distnce δ defined on Σ. egin M(0, 0) 0; for i = 1, m do M(i, 0) k=i k=1 δ(x k, ); end for for j = 1, n do M(0, j) k=j k=1 δ(, y k); end for for i = 1, m do for j = 1, n do M(i, j) min end for end for D(x, y) M(m, n) end M(i 1, j) + δ(x i, ) M(i, j 1) + δ(, y j ) M(i 1, j 1) + δ(x i, y j ) informtion during the computtion nd y cktrcking on the optiml pths in the finl mtrix M (see [30]) computed y the lgorithm. For exmple, on the ugmented lphet Σ = {,,, c, d, e, f, g}, if the costs δ(.,.) re ll equl to unity, the unique optiml lignment etween x = gef nd y = cde is given y: x y It is defined y the sequence = = nd the edit distnce etween x nd y is : c g d e e S(x, y) = S S c S g d S e e S f D(x, y) = δ(, ) + δ(, c) + δ(g, d) + δ(e, e) + δ(f, ) = 0 + 1 + 1 + 0 + 1 = 3 5.2.2 Edit distnce nd nlogy Let S e the set of ll elements S, where nd re elements of Σ, except tht the element S is not in S. Hence, S is the set of ll sequences of trnsformtions etween letters of Σ (with the exception of S ). We cn ugment S with new element, giving n lphet S, like we hve done for Σ in section 3.3.2, to llow deletions nd insertions etween elements of S. f INRIA