Foreword Vincent Claveau IRISA - CNRS Rennes, France In the course of the course supervised symbolic machine learning technique concept learning (i.e. 2 classes) INSA 4 Sources s of sequences Slides and concepts L. Miclet, F. Coste... Sequence of symbols sequential information genomic data (DNA, RNA, protein), language, music, logs, electrocardiogram... how to handle this sequential aspect in machine learning? can we learn automatically to recognize sequences of DNA encoding a certain physioogical property? of problems expressed by sequences Back to the starting point sequence aababaaabb is a positive example sequence aababaaaba is a negative example can we learn automatically to distinguish between sequences leading back to the starting point from the others? of problems expressed by sequences Switching the light consider 2 switches I1 and I2 for one light bulb; 4 states are possible state 1: I1 is in low and I2 is low (light is off) state 2: I1 is in high and I2 is low (light is on) state 3: I1 is in low and I2 is high (light is on) state 2: I1 is in high and I2 is high (light is on) action a modifies state of I1, action b modifies I2 only state 1 is wanted (light switched off) sequences aa, baba and abbbba are accepted sequences a, ab, baa or bbbbbbbbb are not can we learn automatically to sequences of actions leading to state 1?
of problems expressed by sequences Switching the light Other sequences: switching the light Finite state automaton of the problem Basics 1 Vocabulary...: sequence of symbols (from an alphabet Σ)...: set (possibly infinite) of words...: set of rules producing the words of a language Some tools handling sequences grammar (rules) finite state machines (automata, transducers...) trees (prefix tree...) expressions (regular expressions) HMM... Basics 2 Chomsky hierarchy one possible classification (among many others!) of the languages according to increasing expressiveness... grammars (type 3; A a and A ab)... grammars (type 2; A γ with γ = abccbcca)... grammars (type 1; αaβ γ)... grammars (type 0; α β) regular grammars are mastered, in particular, we know how to infer them... we know fewer things on context-free grammars we know almost nothing on context-sensitive and unrestricted grammars Basics 3 In this course we focus only on regular languages we use automata to represent/handle them The 4 methodological questions (cf. class 1) 1 - Describing the examples as sequences of symbols examples : b, aab, aaaab negative examples : aaab, a, aaaaa, bb 2 - Choosing the hypothesis space hypothesis: any automaton (deterministic DFA, or non-deterministic NFA) 3 - Exploring the hypothesis space exploration of discrete space (state merging, see below) 4 - Evaluation classically, by testing the final automaton with a test set
A closer look at the hypothesis space 1 What is in our hypothesis space example of automaton: in this course, we decide that non-deterministic automata are refused Properties of the hypothesis space for a finite set of examples, the hypothesis space is finite the hypothesis space can be hierarchically organized A closer look at the hypothesis space 2 Cover relation in grammatical inference the hypothesis covers the example abaa A closer look at the hypothesis space 3 Subsumption in grammatical inference the hypothesis also covers abaa space Hypothesis space space Hypothesis space A closer look at the hypothesis space 4 Bounds of the hypothesis space most specific (canonical) automaton of the training set most specific (canonical) automaton of the positive examples most general (canonical) automaton (UA) Exploring the hypothesis space Principles learning by exploring the discrete space of automata searching for an automaton with a empiric risk equals to zero bottom-up search: starting from the most specific automaton and generalizing generalization operator: state merging About merging choosing 2 states and merging them to generalize cascade of forced merging to make the automaton deterministic control (stop) the merging with the negative examples
Exploring the hypothesis space Exploring the hypothesis space of merging NB: merging may produce non-deterministic automata example to be done in course Avoiding over-generalization a criterion to stop the merging is needed examples of such criteria limitation to a certain sub-family of automata statistical criterion (the remaining states are considered as to different to be merged) use of negative examples: stop when one e is accepted by the automaton Theoretical and practical problems Open issues why starting from the canonical automaton and exploring by merging? when the training set is enough to be sure to find the good hypothesis? how to choose the state to be merged? can we accept an empiric risk greater than 0? can we generalize to more complex concepts: stochastic automata, transducers, context-free grammars? Finite state automata quintuplet (Q, Σ, δ, q 0, F ) Q: finite set of states Σ: finite alphabet δ: transition function QxΣ 2 Q Q 0 Q: set of initial states F Q: set of final (or accepting) states Deterministic automaton, complete automaton if q Q and a Σ, δ(q, a) contains at most one element (resp. exactly one element) and if Q 0 = 1, the automaton is said deterministic (DFA) (resp. complete)
What can be said about this automaton? Cover relation an automaton (deterministic or not) covers (accepts) a word u = a 1...a j, if there exists a sequence (unique or not) of j + 1 states (q 0,..., q j ) s.t. q 0 Q 0, q j F, 0 i j 1, q i+1 δ(q i, a i+1 ) the j + 1 states are said to be reached for this acceptation and q j is the accepting state the j transitions are said to be used by this acceptation Accepted language the language L(A) accepted by an automaton A is the set of all the sequences accepted by A Partitions a partition π of S is a set of subsets of S, each subset being non-empty and non-overlapping, and such that their union is S if s S, the unique element (block) of π including s is written B(s, π) a partition π i refines (is thinner than) a partition π j iff every block of π j is a block of π i or is the union of several blocks of π i s of partitions consider an automaton containing 5 states: 0, 1, 2, 3, 4 π 2 = {{0, 1}, {2}, {3, 4}} is a possible partition π 3 = {{0, 1, 2}, {3, 4}} is... than π 2 π 4 = {{0}, {1, 3}, {2, 4}}... B(0, π 2 ) =... (block containing state 0 in π 2 ) B(0, π 3 ) =... (block containing state 0 in π 3 )
Derived automata or quotient automaton let A = (Q, Σ, δ, q 0, F ) be an automaton, the automaton derived from A w.r.t. partition π of Q A/π is defined by: Q = Q/π = {B(q, π) q Q} F = {B Q B F } δ : Q xσ 2 Q : B, B Q, a Σ, B δ (B, a) iff q, q Q, q B, q B and q δ(q, a) the states of Q belonging to the same block B of the partition π are said to be merged Derived automata consider the automaton A 1 ; compute A 2 = A 1 /π 2 with π 2 = {{0, 1}, {2}, {3, 4}} Automaton A 1 Derived automata consider the automaton A 1 ; compute A 2 = A 1 /π 2 with π 2 = {{0, 1}, {2}, {3, 4}} Major property of merging if an automaton A/π j derives from an automaton A/π i, then the language accepted by A/π i is included in the one that A/π j accepts thus, A/π j recognizes all the words accepted by A/π i plus other words it means that A/π j is more general than A/π i more formally, the merging operation induces... Practical consequence starting from an automaton A, it is possible to build every automaton derived from A from the partitions of A s states there exists a partial order relation on this set, consistent with the inclusion of the language recognized by these automata Major property of merging - s back to example A 1, we ve seen that choosing the partition π = {{0, 1}, {2}, {3, 4}} make it possible to derive the quotient automaton A 2 = A 1 /π 2 thus, we know that L(A 1 ) L(A 2 ) Exercise compute A 3 = A 1 /π 3 (π 3 = {{0, 1, 2}, {3, 4}}); what can you say about it?
compute A 3 = A 1 /π 3 (π 3 = {{0, 1, 2}, {3, 4}}); what can you say about it? partition π 3 is... than π 2 since its blocks are built as the union of blocks of π 2 thus, we know that L(A 2 )...L(A 3 ) Exercise compute A 4 = A 1 /π 4 (π 4 = {{0}, {1, 3}, {2, 4}}); what can you say about it? compute A 4 = A 1 /π 4 (π 4 = {{0}, {1, 3}, {2, 4}}); what can you say about it? Hypothesis space Space E H and merging The set of automata derived from an automaton A is partially ordered by the subsumption relation given by the derivation; thus, E H is a lattice automaton A is the most specific element (bottom) universal automaton UA is the most general element (top) there are as many elements in E H as possible partitions on the states of A the more we merge states, the more the accepted language grows Structural completeness 1 Structural completeness 2 Language samples positive sample E + : finite subset of a language L negative sample E : finite sample of the complement language Σ L Structural completeness E + is structurally complete w.r.t a deterministic automaton A accepting L if every transition of A has been used every element of F (final states of A) is used as acception state it implements an... - Exercise give several DFA such that E + = {aab, ab, abbbbb} is structurally complete for them
Canonical automata 1 Maximal canonical automaton of E + - MCA biggest automata (in number of states) such that E + is structurally complete written MCA(E + ) = (Q, Σ, δ, q 0, F ); generally non-deterministic (because Q 0 > 1) example of MCA({a, ab, bab}) Canonical automata 2 Prefix tree accepting E + - PTA quotient automaton MCA(E + )/π E +, written PTA(E + ) and defined by: B(q, π E +) = B(q, π E +) iff Pr(q) = Pr(q ) PTA(E + ) is obtained by merging states of MCA(E + ) sharing the same prefixes; by construction, it is deterministic example on the previous sample Maximal generalization 1 Goal of the exploration find the minimal automaton that does not cover any negative example border set (in dash): limit of negative example acceptation Maximal generalization 2 Border set frontier set BS MCA (E +, E ): antichain in which each element is at a maximal depth in E H (space built from MCA(E + )) antichain (fr: antichaîne): subset s.t. no pair of element is in order relation (not comparable) BS PTA (E +, E ) contains the canonical automaton A(L) of any regular language L for which E+ is a positive sample and E a negative one Maximal generalization 3 Maximal generalization 4 Consequences the border set of the lattice built from MCA(E + ) is the set of the most general automata compatible with E + and E the problem of finding the smallest DFA compatible with E + and E is thus equivalent to finding the smallest DFA in the border set built from PTA(E + ) let s consider E + = {b, ab} and E = {bb} the maximal canonical automaton of E + is:
Maximal generalization 5 automata in the border set BS PTA (E +, E ) with E + = {b, ab} and E = {bb} Back on the hypothesis space 1 Fundamental properties - general case let E + be a sample of a regular language L, and A any automaton recognizing exactly L if E + is structurally complete wrt A, then A E H (space built from MCA(E + )) conversely, if A E H (built from MCA(E + )), then E + is structurally complete wrt A Fundamental properties - deterministic case let E + be a sample of a regular language L, and A a cannonical automaton recognizing L if E + is structurally complete wrt A(L), then A(L) E H (built from PTA(E + )) Back on the hypothesis space 2 Size of E H let E + be a sample of an unknown language L structurally complete for an automaton A accepting exactly L A can be derived from a partition π of the states of MCA(E + ), i.e. regular inference = finding partition π thus, the size of E H is the number of partitions P(N) with N the number of states of MCA(E + ) or of PTA(E + ) for example P(10) = 10 5, P(20) = 5 10 13, P(100) = 8.5 10 23... this number grows exponentially, thus we need a clever exploration, guided by heuristics RPNI algorithm Principles RPNI implements a depht-first search in E H built upon PTA(E + ) and find a local optimum to the problem of the smallest DFA by construction, every state of PTA(E + ) corresponds to a unique prefix and these prefixes can be sorted by length and lexicographic order (ɛ, a, b, aa, ab, ba, bb, aaa, aab...) RPNI process with N 1 steps where N is the number of states in PTA(E + ) the partition in step i is obtained by merging the two first blocks (wrt the length and lexicographic order above), of the partition of step i 1, which results in a compatible quotient automaton RPNI algorithm Input: E +, E ; Output: a partition of PTA(E+) corresponding to a DFA compatible with E + and E π {{0}, {1},..., {N 1}} ; N = number of states in PTA(E + ) A PTA(E + ) ; for i = 1 to N 1 do for j = 0 à i 1 do π π \ {B j, B i } {B i B j } ; merging of blocks/states Bi and B j if A/π do not accept elements of E then π determ fusion(a/π ) ; π π endif end for end for Return A A/π ;
RPNI algorithm RPNI algorithm Convergence RPNI outputs a DFA belonging to BS PTA (E +, E ) it is the canonical automaton for the accepted language it is the smallest compatible DFA only if the training data satisfies an additional condition: if they contain a characteristic sample i.e. when the training data are representative enough of the language, the discovery of the canonical automaton of this langage is guaranteed moreover, this automaton the smallest compatible DFA in this particular case Convergence the size of the characteristic sample for this particular algorithm is O(n 2 ) where n is the number of states of the resulting automaton the complexity of RPNI, in the latest published version, is O(( E + + E ) E + 2 ) if the training sample contains every word of length < 2n 1, then identification is guaranteed. yet, this property is hard: if the training set contains every word but a small part of the characteristic sample, identification is not guaranteed any more RPNI step by step RPNI step by step Initial data let E + = {ɛ, ab, aaa, aabaa, aaaba} and E = {aa, baa, aaab}, apply the RPNI algorithm Initial data let E + = {ɛ, ab, aaa, aabaa, aaaba} and E = {aa, baa, aaab} the PTA(E + )... RPNI step by step RPNI step by step Start RPNI begins in merging 2 states. Without any other information, states 0 and 1 that are chosen End back to the starting point, and merging 0 and 3 the resulting DFA is compatible with E + and E it is on the border set; we keep this solution
Conclusion Conclusion Real example 1 Real example 3 Genomic example searching the grammar defining a promoter of B. Subtilis E + = 131, E = 55 062 bottom = 1248616 (PTA or MCA?) solution found: 95 states, 347 transitions, several hours of computing Conclusion Real example 2 Genomic example searching the grammar defining a promoter of B. Subtilis E + = 131, E = 55 062 bottom = 1248616 (PTA or MCA?) solution found: 95 states, 347 transitions, several hours of computing compactness? readability?